Common Crawl Index Table - Need for Schema Merging to be documented

chk2817 commented 4 years ago

Hi,

if some needs to load the full index table (parquet file) into pyspark with all latest fields, there is a need for setting the Spark property spark.sql.parquet.mergeSchema to "true" or use the following

df = spark.read.option("mergeSchema", "true").parquet('s3://commoncrawl/cc-index/table/cc-main/warc/')

Without this, fields that were added at a later stage like content_languagues are not loaded in the spark dataframe.

Maybe we could also provide the complete schema to Spark, so that there is no need to extract the schema initially from (one of) the Parquet files.

Thanks

sebastian-nagel commented 4 years ago

Thanks, @chk2817:

the documenation was done in c251d78
opened #19 to allow to set the table schema explicitly: needs testing on a Spark cluster

sebastian-nagel commented 3 years ago

Successfully tested #19 on Spark cluster.

commoncrawl / cc-pyspark

Common Crawl Index Table - Need for Schema Merging to be documented #18