Open dlmutart opened 4 years ago
@dlmutart you should be able to write right away to S3 using AvroParquet
without need of using the S3
connector.
You can find in this link the propertis that you'd set on org.apache.hadoop.conf.Configuration
that you'll pass to the AvroParquetWriter
.
Although not saying that having a Delta Lake connector would be good idea, probably there would be other useful cases where you can use that one.
Short description
Delta Lake (https://delta.io/) provides a transactional storage layer on top of data lakes that could be used to stream data to and from S3 compatible storage.
Details
Currently, the AvroParquet writer provides no way to read and write Parquet, ORC, or other data formats on S3 compatible data lakes and generally assumes an HDFS file system. Combining AvroParquetWriter with the S3 sink is incompatible as one assumes a file and other assumes a byte stream. A preferred way to read and write streams a Parquet (or other formats such as ORC) would to use Delta Lake. Delta Lake has a Java API that can be used to create the appropriate source and sink in Alpakka. We would implement something similar to what they have for Spark but for Akka.