commoncrawl / cc-index-table

Index Common Crawl archives in tabular format
Apache License 2.0
106 stars 9 forks source link

Upgrade to Spark 3.2.0 #10

Closed sebastian-nagel closed 2 years ago

sebastian-nagel commented 2 years ago

... and increment version number (0.2 -> 0.3)

Spark 3.2.0 ships with an upgrade of parquet-mr 1.12.1 (from 1.10.1). Notable improvements by the Parquet upgrade are:

Upgrading to Spark 3.2.0 makes it possible to explore whether these (and other) features can be utilized, granted they are supported or at least, do not break using the columnar index on Athena, Spark and Hive.