archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

DomainGraphExtractor produces different output in RDD vs DF #436

Closed ruebot closed 4 years ago

ruebot commented 4 years ago

To Reproduce Steps to reproduce the behavior (e.g.):

  1. bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/app-output/DomainGraphText --output-format TEXT

  2. bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/app-output/DomainGraphText --df --output-format TEXT

  3. cat the part files together for each.

  4. $ wc -l DomainGraphText.txt DomainGraphDFtext.csv
    4935 DomainGraphText.txt
    70368 DomainGraphDFtext.csv
    75303 total

Expected behavior

The files should be the same.

Environment information

Additional context

Blocks #435