exasol / cloud-storage-extension

Exasol Cloud Storage Extension for accessing formatted data Avro, Orc and Parquet, on public cloud storage systems
MIT License
7 stars 11 forks source link

Experiment with spark as unifying data reader #320

Open Shmuma opened 1 month ago

Shmuma commented 1 month ago

From the first look (has to be confirmed by #319), we use spark for delta data format, but for other data formats we use lower-level readers. One example is package parquet-io-java which is used for parquet reading and uses avro under the hood: https://github.com/exasol/parquet-io-java.

But spark supports parquet reading and can provide unified interface (dataframes) which can significantly simplify the code, potentially providing the speed up and extra functionality we currently don't have (like filters pushdown).

To evaluate "spark as unified data reader", we need to understand the following:

Task outcome