commoncrawl / cc-index-table

Index Common Crawl archives in tabular format
Apache License 2.0
106 stars 9 forks source link

Store column "fetch_time" as int64 #7

Closed sebastian-nagel closed 4 years ago

sebastian-nagel commented 4 years ago

The column "fetch_time" uses the Parquet int96 data type to store the capture time as Spark/Presto/etc. type "timestamp". Storing the timestamps as int64 would

In addition,

Setting the timestamp type is possible via spark.sql.parquet.outputTimestampType since Spark 2.3.0 (SPARK-10365).

Milliseconds precision should be enough. Although, WARC/1.1 allows WARC-Dates with nanoseconds precision, Common Crawl still follows the WARC/1.0 standard, also because many WARC parsers fail on dates with higher precision.

Needs testing whether Athena/Presto, Spark and Hive can process the int64 timestamps and allow columns with mixed data types together with schema merging.

sebastian-nagel commented 4 years ago

Generated columnar index for testing using --conf spark.sql.parquet.outputTimestampType=TIMESTAMP_MILLIS. As expected about 33% storage are saved on the fetch_time column. Testing was successful for Athena and Spark:

Users of Hive cannot read the fetch_time column for now. Reading all other columns works.

We'll switch to int64 starting with the January 2020 crawl (CC-MAIN-2020-05). An update of already existing parts is eventually done in the future together with further Parquet format improvements.

sebastian-nagel commented 4 years ago

Verified that column fetch_time in January 2020 crawl (CC-MAIN-2020-05) is written as int64.