Store column "fetch_time" as int64

sebastian-nagel commented 4 years ago

The column "fetch_time" uses the Parquet int96 data type to store the capture time as Spark/Presto/etc. type "timestamp". Storing the timestamps as int64 would

make Parquet min/max stats available (thanks to @wumpus, @cldellow for figuring this out)
save space

In addition,

int96 timestamps have been deprecated (PARQUET-323)
int64 timestamps will become the default in Spark 3.0

Setting the timestamp type is possible via spark.sql.parquet.outputTimestampType since Spark 2.3.0 (SPARK-10365).

Milliseconds precision should be enough. Although, WARC/1.1 allows WARC-Dates with nanoseconds precision, Common Crawl still follows the WARC/1.0 standard, also because many WARC parsers fail on dates with higher precision.

Needs testing whether Athena/Presto, Spark and Hive can process the int64 timestamps and allow columns with mixed data types together with schema merging.

sebastian-nagel commented 4 years ago

Generated columnar index for testing using --conf spark.sql.parquet.outputTimestampType=TIMESTAMP_MILLIS. As expected about 33% storage are saved on the fetch_time column. Testing was successful for Athena and Spark:

Athena: int96 and int64 timestamps work both, including Presto datetime functions. Tested by a rather stupid query which selects 20 captures made on Sundays from a given domain:

SELECT url,
     fetch_time,
     day_of_week(fetch_time) as day_of_week,
     extract(hour FROM fetch_time) as hour
FROM "ccindex"."ccindex"
WHERE url_host_registered_domain = ...
AND crawl = ...
AND subset = 'warc'
AND day_of_week(fetch_time) = 7
LIMIT 20

Spark: int96 and int64 timestamps work both. Tested using cc-pyspark:
```
SELECT url, fetch_time FROM ccindex
WHERE url_host_registered_domain = ...
AND crawl = ...
AND subset = 'warc'
AND day_of_week(fetch_time) = 1
LIMIT 20
```
The column fetch time gets exported uniquely following the Spark config value of spark.sql.parquet.outputTimestampType.
Hive: reading int64 timestamps failed: Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable. See HIVE-21215.

Users of Hive cannot read the fetch_time column for now. Reading all other columns works.

We'll switch to int64 starting with the January 2020 crawl (CC-MAIN-2020-05). An update of already existing parts is eventually done in the future together with further Parquet format improvements.

sebastian-nagel commented 4 years ago

Verified that column fetch_time in January 2020 crawl (CC-MAIN-2020-05) is written as int64.

commoncrawl / cc-index-table

Store column "fetch_time" as int64 #7