commoncrawl / cc-index-table

Index Common Crawl archives in tabular format
Apache License 2.0
106 stars 9 forks source link

Downloading the relevant jar file? #27

Closed bbrancar closed 1 year ago

bbrancar commented 1 year ago

Hi,

I have successfully downloaded a csv via Amazon Athena and would like to perform bulk download of the listed WARC files. After cloning the Github and setting my $SPARK_HOME to my download of pyspark in my virtual environment, I have run the code:

> $SPARK_HOME/bin/spark-submit --class org.commoncrawl.spark.examples.CCIndexWarcExport $APPJAR \ --csv xyx ...

This returned the error: Failed to find Spark jars directory (xyz). Do you have any suggestions on how I can resolve this issue?

Thank you

sebastian-nagel commented 1 year ago

The Spark jars directory is $SPARK_HOME/jars/.

bbrancar commented 1 year ago

This was helpful, I had failed to correctly download Spark. Thank you

sebastian-nagel commented 1 year ago

Thanks for the feed. I'll add a link to the Spark installation instructions in the README. Hope it helps future users.