Closed lispc closed 3 years ago
I think rollup process updates offset incorrectly.
Offset 0 msg must be RegisterUser. So i think the offset must be updated incorrectly
reproduce:
have reproduced the issue on me, diving in ...
Well it should be my fault to cause rollup_state_manager crash because I have put messages triggered by two calling of tick.ts into message queue, the conflicting input (duplicated registry and unmatched balance, etc) would cause assertion failure inside rollup_state_manager
However, I fail to found non continuous messages or abnormal offset. For a "valid" message queue (triggered by calling tick.ts only one time), rollup_state_manager always can replay it correctly
My 'minimal' playground is built like this:
After rollup_state_manager have read and processed all messages it received, I can erase some of its dumping records and restart rollup_state_manager, program can always replay the message queue correctly, with offset specified in the latest dumping left.
For example, when we set rollup_state_manager, and process about 600 messages in kafka, rollup_state_manager dump records in 20, 40, 60 and 80, each in one directory.
Then I stop rollup_state_manager, and erased some dumping records, say, 60.db and 80.db, and restart. rollup_state_manager correctly start in the offset in the record of 40.db, and handle the rest message again without any errors.
Getting rid of the other factor which also cause rollup manager crash (see #133) I finally can reproduce the not continuous issues in message receiving.
Also found the issues is never deterministic: simply run rollup manager again and program overcome the discontinue offset then run smoothly. See attachment: in step2_fail.log program throw assert failure because it receive message at offset 3865 after 3863. And in step3_pass.log I run program again and it receive message at offset 3864, and then keep handling.
It seems the issue raise when there are tons of messages lay in kafka topic waiting for read and the message processing thread in rollup manager run too fast?
After adding https://github.com/Fluidex/rollup-state-manager/pull/117, we found the msgs received by consumer is not continuous. Msg of offset 419, msg of offset 420, msg of offset 477.... like this...
There must be something wrong with the consumer.
You can use https://github.com/Fluidex/fluidex-backend as the dev env. ( bash run.sh )