fluidex / rollup-state-manager

5 stars 6 forks source link

bug: kafka skip msgs... #118

Closed lispc closed 3 years ago

lispc commented 3 years ago

After adding https://github.com/Fluidex/rollup-state-manager/pull/117, we found the msgs received by consumer is not continuous. Msg of offset 419, msg of offset 420, msg of offset 477.... like this...

There must be something wrong with the consumer.

You can use https://github.com/Fluidex/fluidex-backend as the dev env. ( bash run.sh )

lispc commented 3 years ago

telegram-cloud-photo-size-5-6296523447685197248-y

I think rollup process updates offset incorrectly.

Offset 0 msg must be RegisterUser. So i think the offset must be updated incorrectly

lispc commented 3 years ago

reproduce:

  1. git clone https://github.com/Fluidex/fluidex-backend and checkout submodules. install deps scripts/install_deps.sh
  2. set persist_every_n_block to 20 in rollup-state-manager/config.yaml
  3. run.sh
  4. after a minute, kill rollup state manager
  5. modify run.sh, and restart rollup state manager only
  6. rollup state manager will crash soon.
  7. check log: less rollup-state-manager/rollup_state_manager.$DATE.log
noel2004 commented 3 years ago

have reproduced the issue on me, diving in ...

noel2004 commented 3 years ago

Well it should be my fault to cause rollup_state_manager crash because I have put messages triggered by two calling of tick.ts into message queue, the conflicting input (duplicated registry and unmatched balance, etc) would cause assertion failure inside rollup_state_manager

However, I fail to found non continuous messages or abnormal offset. For a "valid" message queue (triggered by calling tick.ts only one time), rollup_state_manager always can replay it correctly

My 'minimal' playground is built like this:

  1. Only one kafka and one postgreSQL db instance are used. 3 dbs (prover_cluster, rollup_state_manager and exchange ) have been created in the sole db instance;
  2. Only matchingengine and rollup_state_manager are built and run.
  3. Test data generated by tick.ts
  4. After rollup_state_manager have read and processed all messages it received, I can erase some of its dumping records and restart rollup_state_manager, program can always replay the message queue correctly, with offset specified in the latest dumping left.

    For example, when we set rollup_state_manager, and process about 600 messages in kafka, rollup_state_manager dump records in 20, 40, 60 and 80, each in one directory.

    Then I stop rollup_state_manager, and erased some dumping records, say, 60.db and 80.db, and restart. rollup_state_manager correctly start in the offset in the record of 40.db, and handle the rest message again without any errors.

noel2004 commented 3 years ago

Getting rid of the other factor which also cause rollup manager crash (see #133) I finally can reproduce the not continuous issues in message receiving.

Also found the issues is never deterministic: simply run rollup manager again and program overcome the discontinue offset then run smoothly. See attachment: in step2_fail.log program throw assert failure because it receive message at offset 3865 after 3863. And in step3_pass.log I run program again and it receive message at offset 3864, and then keep handling.

step3_pass.log step2_fail.log

It seems the issue raise when there are tons of messages lay in kafka topic waiting for read and the message processing thread in rollup manager run too fast?