Closed sebastian-nagel closed 4 years ago
Generated columnar index for testing using --conf spark.sql.parquet.outputTimestampType=TIMESTAMP_MILLIS
. As expected about 33% storage are saved on the fetch_time
column. Testing was successful for Athena and Spark:
Athena: int96 and int64 timestamps work both, including Presto datetime functions. Tested by a rather stupid query which selects 20 captures made on Sundays from a given domain:
SELECT url,
fetch_time,
day_of_week(fetch_time) as day_of_week,
extract(hour FROM fetch_time) as hour
FROM "ccindex"."ccindex"
WHERE url_host_registered_domain = ...
AND crawl = ...
AND subset = 'warc'
AND day_of_week(fetch_time) = 7
LIMIT 20
Spark: int96 and int64 timestamps work both. Tested using cc-pyspark:
SELECT url, fetch_time FROM ccindex
WHERE url_host_registered_domain = ...
AND crawl = ...
AND subset = 'warc'
AND day_of_week(fetch_time) = 1
LIMIT 20
The column fetch time gets exported uniquely following the Spark config value of spark.sql.parquet.outputTimestampType
.
Hive: reading int64 timestamps failed: Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable
. See HIVE-21215.
Users of Hive cannot read the fetch_time
column for now. Reading all other columns works.
We'll switch to int64 starting with the January 2020 crawl (CC-MAIN-2020-05). An update of already existing parts is eventually done in the future together with further Parquet format improvements.
Verified that column fetch_time
in January 2020 crawl (CC-MAIN-2020-05) is written as int64.
The column "fetch_time" uses the Parquet int96 data type to store the capture time as Spark/Presto/etc. type "timestamp". Storing the timestamps as int64 would
In addition,
Setting the timestamp type is possible via
spark.sql.parquet.outputTimestampType
since Spark 2.3.0 (SPARK-10365).Milliseconds precision should be enough. Although, WARC/1.1 allows WARC-Dates with nanoseconds precision, Common Crawl still follows the WARC/1.0 standard, also because many WARC parsers fail on dates with higher precision.
Needs testing whether Athena/Presto, Spark and Hive can process the int64 timestamps and allow columns with mixed data types together with schema merging.