Closed mayanksingh2298 closed 2 months ago
I couldn't find anything. The easiest way is to use the hudi or iceberg kafka sink to write into S3 and then use Apache xTable to convert it to delta lake.
It's very roundabout but it seems like delta isn't investing in this area.
The documentation for writing to s3 is in the Writing to S3 section of the README.
The short answer is to set the environment variables AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
. Additionally, set AWS_S3_LOCKING_PROVIDER
to dynamodb
. Then it should work. There are more environment variables for enabling AWS connectivity, but those are not covered in the README, but you can get them in the object store code.
To partition your data by date, you do that when creating your table. This project has no mechanism for creating a new delta table. However, it will respect any settings covered by delta-rs
Not being able to create tables is unfortunate.
Iceberg and hudi's Kafka sink can create tables, insert, upsert and do full crud operations.
Not being able to create tables is unfortunate.
Iceberg and hudi's Kafka sink can create tables, insert, upsert and do full crud operations.
Delta lake has a Kafka connector. Spark is also an option. This project is neither of those things. Nonetheless, delta-rs is a building block to make this possible.
I have only seen a closed source version of the delta lake kafka sink from Confluent (https://docs.confluent.io/kafka-connectors/databricks-delta-lake-sink/current/overview.html#databricks-delta-lake-sink-connector-cp), is that what you're referring to?
That is what I was thinking of. Given that, there is probably a space for an open source connector. I have experimented with a kubernetes operator to do what we're talking about. I recommend bringing up the discussion in the delta users slack because there may be more interest in the topic. It's not a bad idea. It just wasn't the original intention of this particular project.
Where / How do I set the access keys and secret keys to enable writing data to s3?
How do I partition data by date? Is there any documentation for this project?