archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

Add a number of additional app extractors. #451

Closed ruebot closed 4 years ago

ruebot commented 4 years ago

GitHub issue(s): #447

What does this Pull Request do?

Add a number of additional app extractors.

How should this be tested?

bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor AudioInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/AudioInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor ImageInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/ImageInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PDFInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/PDFInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PresentationProgramInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/PresentationProgramInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor SpreadsheetInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/SpreadsheetInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor TextFilesInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/TextFilesInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor VideoInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/VideoInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor WordProcessorInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/WordProcessorInformationExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor WebGraphInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/WebGraphInformationExtractor

Additional Notes:

  1. I just added WebGraphExtractor as an additional option, since it is slightly different than the csv output of DomainGraphExtractor
  2. I tweaked WebPagesExtractor to produce similar, and more enhanced output that PlainTextExtractor. We might want to consider removing PlainTextExtractor in the future
  3. For all the binary extractors, I only added the binary information extractor. Before we add the binary extractor, or binary + binary information (the full DataFrame), we should talk it out a bit more, and do some testing with csv output.
ruebot commented 4 years ago

I'll get an associated documentation PR opened up later today.

codecov[bot] commented 4 years ago

Codecov Report

Merging #451 into master will increase coverage by 2.17%. The diff coverage is 98.58%.

@@            Coverage Diff             @@
##           master     #451      +/-   ##
==========================================
+ Coverage   74.55%   76.72%   +2.17%     
==========================================
  Files          40       49       +9     
  Lines        1285     1422     +137     
  Branches      246      264      +18     
==========================================
+ Hits          958     1091     +133     
- Misses        211      215       +4     
  Partials      116      116              
ruebot commented 4 years ago

Documentation PR: https://github.com/archivesunleashed/aut-docs/pull/57

ruebot commented 4 years ago

Oh, sorry. That was copypasta on my part.

ianmilligan1 commented 4 years ago

Heh no worries @ruebot - it was actually good to see robust error messages.

20/04/21 16:28:11 ERROR CommandLineApp: WebGraphInformationExtractor not supported. The following extractors are supported:
20/04/21 16:28:11 ERROR CommandLineApp: PDFInformationExtractor
20/04/21 16:28:11 ERROR CommandLineApp: TextFilesInformationExtractor
20/04/21 16:28:11 ERROR CommandLineApp: ImageGraphExtractor
20/04/21 16:28:11 ERROR CommandLineApp: WebPagesExtractor
20/04/21 16:28:11 ERROR CommandLineApp: ImageInformationExtractor
20/04/21 16:28:11 ERROR CommandLineApp: WordProcessorInformationExtractor
20/04/21 16:28:11 ERROR CommandLineApp: SpreadsheetInformationExtractor
20/04/21 16:28:11 ERROR CommandLineApp: VideoInformationExtractor
20/04/21 16:28:11 ERROR CommandLineApp: WebGraphExtractor
20/04/21 16:28:11 ERROR CommandLineApp: AudioInformationExtractor
20/04/21 16:28:11 ERROR CommandLineApp: PresentationProgramInformationExtractor
20/04/21 16:28:11 ERROR CommandLineApp: DomainGraphExtractor
20/04/21 16:28:11 ERROR CommandLineApp: DomainFrequencyExtractor
20/04/21 16:28:11 ERROR CommandLineApp: PlainTextExtractor