Closed chk2817 closed 3 years ago
Hi,
if some needs to load the full index table (parquet file) into pyspark with all latest fields, there is a need for setting the Spark property spark.sql.parquet.mergeSchema to "true" or use the following
df = spark.read.option("mergeSchema", "true").parquet('s3://commoncrawl/cc-index/table/cc-main/warc/')
Without this, fields that were added at a later stage like content_languagues are not loaded in the spark dataframe.
content_languagues
Maybe we could also provide the complete schema to Spark, so that there is no need to extract the schema initially from (one of) the Parquet files.
Thanks
Thanks, @chk2817:
Successfully tested #19 on Spark cluster.
Hi,
if some needs to load the full index table (parquet file) into pyspark with all latest fields, there is a need for setting the Spark property spark.sql.parquet.mergeSchema to "true" or use the following
df = spark.read.option("mergeSchema", "true").parquet('s3://commoncrawl/cc-index/table/cc-main/warc/')
Without this, fields that were added at a later stage like
content_languagues
are not loaded in the spark dataframe.Maybe we could also provide the complete schema to Spark, so that there is no need to extract the schema initially from (one of) the Parquet files.
Thanks