Closed ruebot closed 4 years ago
FYI from AUT PR #236:
--df
flag)./aut_runtree/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut_self/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input ./aut_self/aut/src/test/resources/warc/example.warc.gz ./aut_self/aut/src/test/resources/arc/example.arc.gz --output output1 --df
--partition
flag)./aut_runtree/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut_self/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor DomainFrequencyExtractor --input ./aut_self/aut/src/test/resources/warc/example.warc.gz ./aut_self/aut/src/test/resources/arc/example.arc.gz --output output2 --df --partition 1
Output will be a single file rather than PART-0000, PART-0001, etc.
--split
flag)./aut_runtree/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut_self/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor DomainFrequencyExtractor --input ./aut_self/aut/src/test/resources/warc/example.warc.gz ./aut_self/aut/src/test/resources/arc/example.arc.gz --output output3 --df --split
I can't completely remember the context of why this was done vs. just loading in scripts?
It was this one: https://github.com/archivesunleashed/aut/issues/195. Makes it a lot easier to use spark-submit
.
I wasn't paying much attention at the time since I was heads down on auk
, so I don't recall ever really putting anything through its paces with spark-submit
.
We have no to little documentation (other than doc comments) of the command line app https://github.com/archivesunleashed/aut/tree/master/src/main/scala/io/archivesunleashed/app