delta-io / kafka-delta-ingest

A highly efficient daemon for streaming data from Kafka into Delta Lake
Apache License 2.0
337 stars 71 forks source link

Documentation to write data to S3 #166

Closed mayanksingh2298 closed 2 months ago

mayanksingh2298 commented 5 months ago

Where / How do I set the access keys and secret keys to enable writing data to s3?

How do I partition data by date? Is there any documentation for this project?

alberttwong commented 3 months ago

I couldn't find anything. The easiest way is to use the hudi or iceberg kafka sink to write into S3 and then use Apache xTable to convert it to delta lake.

It's very roundabout but it seems like delta isn't investing in this area.

mightyshazam commented 2 months ago

The documentation for writing to s3 is in the Writing to S3 section of the README. The short answer is to set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Additionally, set AWS_S3_LOCKING_PROVIDER to dynamodb. Then it should work. There are more environment variables for enabling AWS connectivity, but those are not covered in the README, but you can get them in the object store code.

mightyshazam commented 2 months ago

To partition your data by date, you do that when creating your table. This project has no mechanism for creating a new delta table. However, it will respect any settings covered by delta-rs

alberttwong commented 2 months ago

Not being able to create tables is unfortunate.

Iceberg and hudi's Kafka sink can create tables, insert, upsert and do full crud operations.

mightyshazam commented 2 months ago

Not being able to create tables is unfortunate.

Iceberg and hudi's Kafka sink can create tables, insert, upsert and do full crud operations.

Delta lake has a Kafka connector. Spark is also an option. This project is neither of those things. Nonetheless, delta-rs is a building block to make this possible.

alberttwong commented 2 months ago

I have only seen a closed source version of the delta lake kafka sink from Confluent (https://docs.confluent.io/kafka-connectors/databricks-delta-lake-sink/current/overview.html#databricks-delta-lake-sink-connector-cp), is that what you're referring to?

mightyshazam commented 2 months ago

That is what I was thinking of. Given that, there is probably a space for an open source connector. I have experimented with a kubernetes operator to do what we're talking about. I recommend bringing up the discussion in the delta users slack because there may be more interest in the topic. It's not a bad idea. It just wasn't the original intention of this particular project.