delta-io / kafka-delta-ingest

A highly efficient daemon for streaming data from Kafka into Delta Lake
Apache License 2.0
337 stars 72 forks source link

Enable writes to Azure #114

Closed thovoll closed 1 year ago

thovoll commented 2 years ago

@xianwill and I paired on parts of this, thanks for your help Christian!

This PR upgrades kafka-delta-ingest to a newer version of delta-rs that includes Azure support. Various libraries were upgraded as needed. Dockerfile added and README updated to allow execution on Windows. Deprecated temporarily allowed to avoid having to rewrite the clap code in the main module.

xianwill commented 2 years ago

@thovoll - oh snap. Test helper compile is failing because parquet::util has been made private between versions. Fortunately, looks like SliceableCursor is exported publicly so you should be able to fix this by changing the use.

thovoll commented 2 years ago

Thanks @xianwill, fixing this ASAP.

thovoll commented 2 years ago

@mosyp would you be able to take a look at the failing integration tests? At least the coercion and dlq tests are confirmed failing. OTOH, we have the kdi process running fine in the Debian docker image. Possibly something going on with the integration test setup?

mightyshazam commented 1 year ago

@mosyp would you be able to take a look at the failing integration tests? At least the coercion and dlq tests are confirmed failing. OTOH, we have the kdi process running fine in the Debian docker image. Possibly something going on with the integration test setup?

The reason the tests are failing is because of the upgrades to parquet, arrow and delta-rs. The ArrowWriter implementation has changed. In the old version, it would flush all of the records when calling the write method. In the version in this PR, the writer buffers the rows in memory until it reaches max_row_group_size buffered rows. Once that happens, or someone closes the file, it flushes. The code in lib.rs that checks whether it should complete the batch based on file size looks at the size of the cursor passed to the arrow writer. This cursor only contains a few bytes until the flush happens. One way around this is to set max_row_group_size to max_messages_per_batch from IngestionOptions. Another is to modify the writer in the arrow-rs to expose the size of the buffered data.

rtyler commented 1 year ago

Closing this in favor of https://github.com/delta-io/kafka-delta-ingest/pull/136 which @mightyshazam and I are collaborating on.