Closed forsberg closed 1 year ago
Inspecting code, line 692 seems very suspicious. I guess it tries to compensate for line 644
In _seek_offsets we see that the code handles the case when earliest_offset was 0, but it's not handling the case when the earliest offset in Kafka is > 0 and offset
variable is set to one number less than that.
Note: Same behaviour when scaling up on empty disk, i.e no previous RocksDB state.
Hello, I had a similar error and described it in #176. My application is trying to restore an offset that no longer exists due to the segment time.
Have you already found a solution to this problem?
Same issue here. @forsberg did your fix solve the issue ?
We have code for https://github.com/forsberg/faust-streaming/commit/26cb488add47759df68c99842674a54d72f322a1 merged in at this point merged in by db6a3ae28ace1112132ef5dc07f4cf50afcdc427, I think this issue is safe to close unless someone can provide a reproducible example of this issue still persisting.
Checklist
master
branch of Faust.Steps to reproduce
Faust application that uses a Table with Tumbling Window.
Scaling up from 4 to 6 workers, where the two newly started workers already have a RocksDB on disk, but the offset commited in RocksDB has a lower value than the earliest available offset in Kafka, i.e. triggering the need to read the full changelog.
Expected behavior
Full changelog is read into RocksDB and application starts processing data.
Actual behavior
Application gets stuck doing nothing, waiting for data on the changelog partition.
In the below example, the oldest offset for partition 9 as reported by Kafka metrics is 18457042, which is exactly one more than the offset Faust is trying to reset to. This makes me suspect this is an off by one error.
Inspecting the code in aiokafka that emits the error message about Fetch offset being out of range, I'm guessing that the default reset strategy is set to
latest
which is the default, something that will lead to Faust not getting any messages and getting stuck.The above message will repeat forever.
There is also a different behaviour that I think happen when one of the partitions can be fetched, but not the other, when it get stuck repeating the following message:
Versions