akka / alpakka

Alpakka is a Reactive Enterprise Integration library for Java and Scala, based on Reactive Streams and Akka.
https://doc.akka.io/libraries/alpakka/current/
Other
1.26k stars 645 forks source link

Delta Lake source and sink #2384

Open dlmutart opened 4 years ago

dlmutart commented 4 years ago

Short description

Delta Lake (https://delta.io/) provides a transactional storage layer on top of data lakes that could be used to stream data to and from S3 compatible storage.

Details

Currently, the AvroParquet writer provides no way to read and write Parquet, ORC, or other data formats on S3 compatible data lakes and generally assumes an HDFS file system. Combining AvroParquetWriter with the S3 sink is incompatible as one assumes a file and other assumes a byte stream. A preferred way to read and write streams a Parquet (or other formats such as ORC) would to use Delta Lake. Delta Lake has a Java API that can be used to create the appropriate source and sink in Alpakka. We would implement something similar to what they have for Spark but for Akka.

paualarco commented 4 years ago

@dlmutart you should be able to write right away to S3 using AvroParquet without need of using the S3 connector. You can find in this link the propertis that you'd set on org.apache.hadoop.conf.Configuration that you'll pass to the AvroParquetWriter. Although not saying that having a Delta Lake connector would be good idea, probably there would be other useful cases where you can use that one.