commoncrawl / cc-pyspark

Process Common Crawl data with Python and Spark
MIT License
406 stars 86 forks source link

Common Crawl Index Table - Need for Schema Merging to be documented #18

Closed chk2817 closed 3 years ago

chk2817 commented 4 years ago

Hi,

if some needs to load the full index table (parquet file) into pyspark with all latest fields, there is a need for setting the Spark property spark.sql.parquet.mergeSchema to "true" or use the following

df = spark.read.option("mergeSchema", "true").parquet('s3://commoncrawl/cc-index/table/cc-main/warc/')

Without this, fields that were added at a later stage like content_languagues are not loaded in the spark dataframe.

Maybe we could also provide the complete schema to Spark, so that there is no need to extract the schema initially from (one of) the Parquet files.

Thanks

sebastian-nagel commented 4 years ago

Thanks, @chk2817:

sebastian-nagel commented 3 years ago

Successfully tested #19 on Spark cluster.