Closed aksestok closed 1 month ago
@aksestok interesting! 90% of the code is to avoid the situation above... but you are probably right. it seems subscribe will attach to current offsets of the topics. also it looks like that our approach will prevent this automatic load balancing with subscriber groups. but I think it is OK
@IlyaFaer please take a look! and I think we miss a test where we re-read the messages in the queue by discarding the state and trying again. could you add it? also please test locally since our kafka cloud is defunc
@aksestok, I wonder what destination you are using?
@IlyaFaer
We use BigQuery (also in the issue description š).
@aksestok, oh, yeah, it is š Let's see...
dlt version
0.4.7
Source name
kafka
Describe the problem
When using the Kafka source, if for any reason the extraction is interupted, messages consumed in the initial run are not reprocessed and are lost forever.
My understanding is that this happens because the assignment made in the offset tracker is undone by the
Conumser.subscribe
call immeaditely after:Commenting out the line in question (Line 82
consumer.subscribe(topics)
) seems to resolve the issue.Expected behavior
The source tracks it's own state and should not rely on offsets provided by the broker?
Steps to reproduce
Run a Kafka pipeline and abort during extraction. Restart the pipeline and watch all the precious messages be gone forever.
How you are using the source?
I run this source in production.
Operating system
Linux
Runtime environment
Kubernetes
Python version
3.12
dlt destination
bigquery
Additional information
No response