From the first look (has to be confirmed by #319), we use spark for delta data format, but for other data formats we use lower-level readers. One example is package parquet-io-java which is used for parquet reading and uses avro under the hood: https://github.com/exasol/parquet-io-java.
But spark supports parquet reading and can provide unified interface (dataframes) which can significantly simplify the code, potentially providing the speed up and extra functionality we currently don't have (like filters pushdown).
To evaluate "spark as unified data reader", we need to understand the following:
what formats we're currently supporting in cloud-storage-extension are supported by spark and which are not
what are performance implications on replacing our custom data reader with spark dataframes
can we get rid of some extra dependencies after such replacement? What's the effect on jar file size?
Task outcome
list of formats supported by spark
experimental implementation of one file format (parquet) using spark instead of parquet-io-java package
From the first look (has to be confirmed by #319), we use spark for delta data format, but for other data formats we use lower-level readers. One example is package parquet-io-java which is used for parquet reading and uses avro under the hood: https://github.com/exasol/parquet-io-java.
But spark supports parquet reading and can provide unified interface (dataframes) which can significantly simplify the code, potentially providing the speed up and extra functionality we currently don't have (like filters pushdown).
To evaluate "spark as unified data reader", we need to understand the following:
Task outcome