commoncrawl / cc-index-table

Index Common Crawl archives in tabular format
Apache License 2.0
107 stars 9 forks source link

Replace int96 timestamps in index partitions before CC-MAIN-2020 #13

Open sebastian-nagel opened 2 years ago

sebastian-nagel commented 2 years ago

See #7 and announcement of January 2020 crawl.

Recent Parquet library versions (1.12.2) start to complain about the int96 timestamps:

$> parquet-cli cat -c fetch_time -n 5 s3a://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-43/subset=warc/part-00247-f47c372a-e3d4-4f2b-b7a0-a939c04fd01e.c000.gz.parquet
Argument error: INT96 is deprecated. As interim enable READ_INT96_AS_FIXED  flag to read as byte array.

No complains for data from 2020 and newer:

$> parquet-cli cat -c fetch_time -n 5 s3a://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2020-05/subset=warc/part-00243-2224c996-15d6-400a-8ae4-2d0740e74c18.c000.gz.parquet
1579483394000
1580078106000
1580035997000
1579264777000
1579422799000

Tasks: