archivesunleashed / aut-docs

AUT documentation
https://aut.docs.archivesunleashed.org/
2 stars 2 forks source link

Document command line app #14

Closed ruebot closed 4 years ago

ruebot commented 4 years ago

We have no to little documentation (other than doc comments) of the command line app https://github.com/archivesunleashed/aut/tree/master/src/main/scala/io/archivesunleashed/app

ianmilligan1 commented 4 years ago

FYI from AUT PR #236:

DataFrame implementation (--df flag)

./aut_runtree/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut_self/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input ./aut_self/aut/src/test/resources/warc/example.warc.gz ./aut_self/aut/src/test/resources/arc/example.arc.gz  --output output1 --df 

Partition (combining all fies together) (--partition flag)

./aut_runtree/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut_self/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor DomainFrequencyExtractor --input ./aut_self/aut/src/test/resources/warc/example.warc.gz ./aut_self/aut/src/test/resources/arc/example.arc.gz  --output output2 --df  --partition 1

Output will be a single file rather than PART-0000, PART-0001, etc.

Each W/ARC to their own directory (--split flag)

./aut_runtree/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut_self/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor DomainFrequencyExtractor --input ./aut_self/aut/src/test/resources/warc/example.warc.gz ./aut_self/aut/src/test/resources/arc/example.arc.gz  --output output3 --df  --split

I can't completely remember the context of why this was done vs. just loading in scripts?

ruebot commented 4 years ago

It was this one: https://github.com/archivesunleashed/aut/issues/195. Makes it a lot easier to use spark-submit.

I wasn't paying much attention at the time since I was heads down on auk, so I don't recall ever really putting anything through its paces with spark-submit.