delta-io / kafka-delta-ingest

A highly efficient daemon for streaming data from Kafka into Delta Lake
Apache License 2.0
359 stars 79 forks source link

no support for gzip? #89

Closed bbigras closed 2 years ago

bbigras commented 2 years ago

Here's a test case with redpanda and vector.

git clone https://github.com/bbigras/test-kdi-gzip

docker-compose up -d
# start kdi
# add a couple of lines to the `log` file
# wait for one "Delta write for version x has completed in x millis"
# stop kdi

# uncomment "compression" in vector.toml
docker-compose up -d --force-recreate

# start kdi again

run kdi with:

target/release/kafka-delta-ingest ingest my-topic ~/checkout_folder/delta \
  --checkpoints \
  -l 5 \
  --max_messages_per_batch 2 \
  --kafka 127.0.0.1:9092 \
  -K "auto.offset.reset=earliest" \
  -t \
      'date: substr(timestamp, `0`, `10`)' \
      'message: message' \
      'timestamp: timestamp' \
      'meta.kafka.offset: kafka.offset' \
      'meta.kafka.partition: kafka.partition' \
      'meta.kafka.topic: kafka.topic'

and you'll get:

[2021-10-22T20:06:43Z ERROR kafka_delta_ingest] Error getting BorrowedMessage while processing stream KafkaError (Message consumption error: NotImplemented (Local: Not implemented))
houqp commented 2 years ago

Looking at the upstream code, libz feature should be enabled by default, so perhaps it has something to do with how are are using the kafka client: https://github.com/fede1024/rust-rdkafka/blob/37ba1d22e24a948bc4f0c0f8a609390b24108e1f/Cargo.toml#L44

bbigras commented 2 years ago

The problem was that I didn't have zlib in my path while building kafka-delta-ingest.

I figured it out by setting compression.codec to gzip. I think this option is only for writes (not read) but kafka-delta-ingest then complained about the missing zlib at runtime.