Closed thovoll closed 1 year ago
@thovoll - oh snap. Test helper compile is failing because parquet::util
has been made private between versions. Fortunately, looks like SliceableCursor
is exported publicly so you should be able to fix this by changing the use
.
Thanks @xianwill, fixing this ASAP.
@mosyp would you be able to take a look at the failing integration tests? At least the coercion and dlq tests are confirmed failing. OTOH, we have the kdi process running fine in the Debian docker image. Possibly something going on with the integration test setup?
@mosyp would you be able to take a look at the failing integration tests? At least the coercion and dlq tests are confirmed failing. OTOH, we have the kdi process running fine in the Debian docker image. Possibly something going on with the integration test setup?
The reason the tests are failing is because of the upgrades to parquet, arrow and delta-rs. The ArrowWriter implementation has changed. In the old version, it would flush all of the records when calling the write method. In the version in this PR, the writer buffers the rows in memory until it reaches max_row_group_size
buffered rows. Once that happens, or someone closes the file, it flushes.
The code in lib.rs that checks whether it should complete the batch based on file size looks at the size of the cursor passed to the arrow writer. This cursor only contains a few bytes until the flush happens.
One way around this is to set max_row_group_size
to max_messages_per_batch
from IngestionOptions
. Another is to modify the writer in the arrow-rs
to expose the size of the buffered data.
Closing this in favor of https://github.com/delta-io/kafka-delta-ingest/pull/136 which @mightyshazam and I are collaborating on.
@xianwill and I paired on parts of this, thanks for your help Christian!
This PR upgrades kafka-delta-ingest to a newer version of delta-rs that includes Azure support. Various libraries were upgraded as needed. Dockerfile added and README updated to allow execution on Windows. Deprecated temporarily allowed to avoid having to rewrite the clap code in the main module.