archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

Remove RDD option in app; DataFrame only now. #450

Closed ruebot closed 4 years ago

ruebot commented 4 years ago

GitHub issue(s): #449

What does this Pull Request do?

Remove RDD option in app; DataFrame only now.

I'll get an associated documentation PR with this as well.

How should this be tested?

If you want to robust, the following:

bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/DomainGraphExtractorGRAPHML --output-format GRAPHML;
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/DomainGraphExtractorGEXF --output-format GEXF
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/DomainGraphExtractorTEXT
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor DomainFrequencyExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/DomainFrequencyExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor ImageGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/ImageGraphExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PlainTextExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/PlainTextExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor WebPagesExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/WebPagesExtractor
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor DomainFrequencyExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/DomainFrequencyExtractorSingle --partition 1
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor ImageGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/ImageGraphExtractorSingle --partition 1
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PlainTextExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/PlainTextExtractorSingle --partition 1
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor WebPagesExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/449-test/WebPagesExtractorSingle --partition 1

Should produce the following:

[nruest@bomba:449-test]$ tree .
.
├── DomainFrequencyExtractor
│   ├── part-00000-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00001-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00002-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00003-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00004-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00005-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00006-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00007-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00008-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00009-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00010-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00011-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00012-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00013-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00014-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00015-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00016-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00017-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00018-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00019-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   ├── part-00020-c347d2b0-ea6d-4256-b43f-ca6180db78e1-c000.csv
│   └── _SUCCESS
├── DomainFrequencyExtractorSingle
│   ├── part-00000-804dacc4-932c-44ea-b10e-66430f8f3a45-c000.csv
│   └── _SUCCESS
├── DomainGraphExtractorGEXF
│   └── GEXF.gexf
├── DomainGraphExtractorGRAPHML
│   └── GRAPHML.graphml
├── DomainGraphExtractorTEXT
│   ├── part-00000-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00001-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00002-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00003-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00004-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00005-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00006-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00007-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00008-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00009-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00010-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00011-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00012-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00013-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00014-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00015-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00016-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00017-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00018-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00019-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00020-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00021-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00022-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00023-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00024-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00025-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00026-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00027-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00028-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00029-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00030-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00031-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00032-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00033-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00034-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00035-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00036-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00037-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00038-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00039-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00040-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00041-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00042-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00043-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00044-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00045-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00046-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00047-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00048-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00049-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00050-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00051-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00052-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00053-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00054-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00055-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00056-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00057-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00058-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00059-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00060-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00061-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00062-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00063-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   ├── part-00064-426dfb8a-2bf5-42c1-97fc-a8e4c257e342-c000.csv
│   └── _SUCCESS
├── ImageGraphExtractor
│   ├── part-00000-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00001-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00002-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00003-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00004-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00005-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00006-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00007-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00008-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   ├── part-00009-75c3fe4c-d5eb-4ed2-a4bf-04e37d4889d1-c000.csv
│   └── _SUCCESS
├── ImageGraphExtractorSingle
│   ├── part-00000-005dd922-dc35-46ca-b9d3-c3184637e1db-c000.csv
│   └── _SUCCESS
├── PlainTextExtractor
│   ├── part-00000-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00001-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00002-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00003-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00004-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00005-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00006-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00007-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00008-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   ├── part-00009-2c580b7e-9025-4112-89ce-5d4e76e06977-c000.csv
│   └── _SUCCESS
├── PlainTextExtractorSingle
│   ├── part-00000-e982ea04-0176-4070-a739-6532aef2edba-c000.csv
│   └── _SUCCESS
├── WebPagesExtractor
│   ├── part-00000-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00001-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00002-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00003-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00004-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00005-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00006-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00007-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00008-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   ├── part-00009-a5be4790-082c-44f6-8f1c-4d1cd219ae10-c000.csv
│   └── _SUCCESS
└── WebPagesExtractorSingle
    ├── part-00000-db87b0bc-5761-4f8c-bd79-92dbaf41d0fd-c000.csv
    └── _SUCCESS

11 directories, 131 files
codecov[bot] commented 4 years ago

Codecov Report

Merging #450 into master will decrease coverage by 1.00%. The diff coverage is 88.88%.

@@            Coverage Diff             @@
##           master     #450      +/-   ##
==========================================
- Coverage   75.55%   74.55%   -1.01%     
==========================================
  Files          40       40              
  Lines        1395     1285     -110     
  Branches      265      246      -19     
==========================================
- Hits         1054      958      -96     
+ Misses        218      211       -7     
+ Partials      123      116       -7     
ruebot commented 4 years ago

Documentation PR: https://github.com/archivesunleashed/aut-docs/pull/56