commoncrawl / cc-pyspark

Process Common Crawl data with Python and Spark
MIT License
406 stars 86 forks source link

Commands to execute python files? #12

Closed calee88 closed 4 years ago

calee88 commented 4 years ago

It would have been helpful, if there were some command examples for each .py files. Or am I not finding those? For now, I need to read every line of codes to understand the examples. Still, I appreciate the examples, it would be much harder without the examples.

sebastian-nagel commented 4 years ago

So far, there's only the list of examples in the README with a short description which data is extracted. In addition, every example will show a command-line help if called with option --help.

What do you exactly need?

Let me know what you need! You may also ask for help and support on the Common Crawl forum.

calee88 commented 4 years ago

@sebastian-nagel Thank you for the reply.

The first one you mentioned is what I imagined when I wrote the issue. The second option is great, if you have time. The third option seems too much for this repository.

Here is my story of struggle, and it is still going on. You may skip reading this part. I am using Ubuntu 18.04.3 LTS. What I want to achieve is to extract a monolingual text from Common Crawl. I started from the command-line help of cc_index_word_count.py. I needed to search to find the path to Common Crawl index table and I also figured out that the optional query argument is not optional. I also need to change the default Java version. Those were fine. Then, I got an error about s3: No FileSystem for scheme: s3. So, I searched for the internet and found that I need packages for that so I added --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 to the command, and changed the path to s3a://... Now, it complained about AWS credentials, even though I "aws configure"d. My solution was to export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

So, this is the current bash script:

#!/bin/bash
export AWS_ACCESS_KEY_ID=***
export AWS_SECRET_ACCESS_KEY=***
query="SELECT url, warc_filename, warc_record_offset, warc_record_length FROM ccindex LIMIT 10"

$SPARK_HOME/bin/spark-submit \
  --conf spark.hadoop.parquet.enable.dictionary=true \
  --conf spark.hadoop.parquet.enable.summary-metadata=false \
  --conf spark.sql.hive.metastorePartitionPruning=true \
  --conf spark.sql.parquet.filterPushdown=true \
  --conf spark.sql.parquet.mergeSchema=true \
  --conf spark.dynamicAllocation.maxExecutors=1 \
  --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 \
  --executor-cores 1 \
  --num-executors 1 \
  cc_index_word_count.py --query "${query}" \
  --num_input_partitions 1 \
  --num_output_partitions 1 \
  s3a://commoncrawl/cc-index/table/cc-main/warc/ word_count

And I'm getting org.apache.http.conn.ConnectionPoolTimeoutException. I tried to limit executors (somebody on the internet suggested it), but it doesn't work as I expected. The exception is happening at "df = spark.read.load(table_path)" line of sparkcc.py.

Thank you for reading!

sebastian-nagel commented 4 years ago

Hi @calee88, thanks for the careful report. I've opened #13 and #14 to improve documentation and command-line help.

When querying the columnar index (--query): the data is located in the AWS us-east-1 region (Northern Virginia). It can be accessed remotely but this requires a reliable and fast internet connection. In case you have own an AWS account, there are two options two avoid timeouts:

Let me know whether this works for you!

calee88 commented 4 years ago

Thank you for the reply @sebastian-nagel! I'm using a reliable and fast internet, although I'm far from Northern Virginia. I think the internet should not be a problem here. Have you tried to access remotely using the script I posted? Were you successful? Anyway, I'm going to try Athena or AWS as you suggested.

calee88 commented 4 years ago

Hello @sebastian-nagel. I am now able to query using Athena and use the csv file for the script. I still cannot use query argument, but let me close this as my original issue is summarized on #13 .

sebastian-nagel commented 4 years ago

Thanks, @calee88, for the feedback. #13 will get addressed soon. Yes, I'm able to run the script

calee88 commented 4 years ago

Thank you for the reply @sebastian-nagel. Athena seems much faster, so I'll just keep using it. I hope someone find this thread helpful.