DelphinusLab / prover-node-docker

Docker environment for zkwasm prover node
10 stars 4 forks source link

zkwasm_playground::service::runner: Panic happen, restore ark indent to 0. #14

Closed kii-awesome closed 1 month ago

kii-awesome commented 2 months ago

Hi after updating my prover I got this

prover-dry-run-service-1  | [2024-08-13T18:48:02Z ERROR zkwasm_playground::service::runner] Panic happen, restore ark indent to 0.
prover-dry-run-service-1  | [2024-08-13T18:48:02Z ERROR zkwasm_playground::dry_run_service::runner] Error running dry run for task: Some(ObjectId("669d7f87a7007a2f46e6257e"))
prover-dry-run-service-1  | [2024-08-13T18:48:02Z ERROR zkwasm_playground::dry_run_service::runner] Dry run failed for task: DryRunError("Dry run failed: Panic in dry run: thread 'DryRunThread' panicked at 'Unexpected failure: get leaf fail: MerkleError { source: [188, 253, 65, 218, 173, 59, 83, 153, 248, 158, 188, 251, 140, 171, 194, 146, 49, 25, 196, 85, 92, 114, 28, 9, 130, 126, 105, 4, 197, 80, 208, 32], index: 0, code: InvalidHash }', /home/richard/.cargo/git/checkouts/zkwasm-73ae3e04549605ed/cc66e1f/crates/host/src/host/merkle_helper/merkle.rs:130:18\nnote: run with `RUST_BACKTRACE=1` environment variable to display a backtrace\n. Please check the logfile on the prover node for details.", "")
prover-dry-run-service-1  | thread 'main' panicked at 'Dry run failed for task: DryRunError("Dry run failed: Panic in dry run: thread 'DryRunThread' panicked at 'Unexpected failure: get leaf fail: MerkleError { source: [188, 253, 65, 218, 173, 59, 83, 153, 248, 158, 188, 251, 140, 171, 194, 146, 49, 25, 196, 85, 92, 114, 28, 9, 130, 126, 105, 4, 197, 80, 208, 32], index: 0, code: InvalidHash }', /home/richard/.cargo/git/checkouts/zkwasm-73ae3e04549605ed/cc66e1f/crates/host/src/host/merkle_helper/merkle.rs:130:18\nnote: run with `RUST_BACKTRACE=1` environment variable to display a backtrace\n. Please check the logfile on the prover node for details.", "")', src/dry_run_service/runner.rs:210:17
yymone commented 2 months ago

From the prover node log, it means it cannot find the any initial task id, which means the dry run never success and the db is fresh new. You mentioned you are upgrading from old version? If that is the case, it means the mahchine's local db looks like have issue and cannot be read. It may also the reason why dry run failed in merkle tree. so we suggest you can try remove both workspace volume and local db directory and retry. I assume you are all using default docker settings so:

  1. Stop all container by "docker compose down"
  2. docker volume rm prover-node-docker_workspace-volume to remove workspace volume
  3. Under the prover-node-docker folder, remove the mongo dir.
  4. docker compose up to start again.
kii-awesome commented 2 months ago

Yeap happened after upgrading the prover node. Tried deleting the image and restarting it should be fine for a while. I think the problem is also with the db but I'm not sure myself.

Just did what you asked and let's see. Will give a follow up here.

kii-awesome commented 2 months ago

Update.

still error

prover-dry-run-service-1  | [2024-08-15T02:36:45Z ERROR zkwasm_playground::service::runner] Panic happen, restore ark indent to 0.
prover-dry-run-service-1  | [2024-08-15T02:36:45Z ERROR zkwasm_playground::dry_run_service::runner] Error running dry run for task: Some(ObjectId("66b71ebdb5fc356808821450"))
...
prover-dry-run-service-1  | [2024-08-15T02:36:45Z ERROR zkwasm_playground::dry_run_service::runner] Dry run failed for task: DryRunError("Dry run failed: Panic in dry run: fetch[4099380645510411023, 4288007636041095757, 6631711623484847052, 2035010388109384141, 12884901888, 10640506508122142577, 10429521298291405618, 2370344531932765292, 120271624000914872, 8589934592, 10640506508122142577, 10429521298291405618, 2370344531932765292, 120271624000914872, 281479271677952, 4099380645510411023, 4288007636041095757, 6631711623484847052, 2035010388109384141, 281496451547136, 4099380645510411023, 4288007636041095757, 6631711623484847052, 2035010388109384141, 562975723225089]fetch[4099380645510411023, 4288007636041095757, 6631711623484847052, 2035010388109384141, 12884901888, 10640506508122142577, 10429521298291405618, 2370344531932765292, 120271624000914872, 8589934592, 10640506508122142577, 10429521298291405618, 2370344531932765292, 120271624000914872, 281479271677952, 4099380645510411023, 4288007636041095757, 6631711623484847052, 2035010388109384141, 281496451547136]fetch[4099380645510411023, 4288007636041095757, 6631711623484847052, 2035010388109384141, 12884901888, 10640506508122142577, 10429521298291405618, 2370344531932765292, 120271624000914872, 8589934592, 10640506508122142577, 10429521298291405618, 2370344531932765292, 120271624000914872, 281479271677952]fetch[4099380645510411023, 4288007636041095757, 6631711623484847052, 2035010388109384141, 12884901888, 10640506508122142577, 10429521298291405618, 2370344531932765292, 120271624000914872, 8589934592]fetch[4099380645510411023, 4288007636041095757, 6631711623484847052, 2035010388109384141, 12884901888]=-=-= dump queue =-=-=\n[1] - 2 - 6\n[0] - 1 - 5\n[0] - 1 - 1\n[0] - 0 - 2\n[0] - 0 - 3\n=-=-= end =-=-=\nroot after process [4359988806327673052, 9074689522097986408, 8048241652779801675, 48084975262841424]\n=-=-= dump queue =-=-=\n[0] - 2 - 6\n[0] - 1 - 5\n[0] - 1 - 1\n[0] - 0 - 2\n[0] - 0 - 3\n=-=-= end =-=-=\nstore object [4287989799505254989, 6631711623484847052, 4288007636041095757, 65283]\nend store object\nstore player\nend store player\nstore object [4287989799505189453, 6631711623484847052, 4288007636041095757, 65283]\nend store object\nstore player\nend store player\nstore object [10429492316355015474, 2370344531932765292, 10429521298291405618, 65283]\nend store object\nstore player\nend store player\nstore object [10429492316354949938, 2370344531932765292, 10429521298291405618, 65283]\nend store object\nstore player\nend store player\nstore object [4287989799505123917, 6631711623484847052, 4288007636041095757, 65283]\nend store object\nstore player\nend store player\nroot after process [967058694493352534, 9074863511553287875, 9175668587783503505, 795617983756447423]\n=-=-= dump queue =-=-=\n[1] - 0 - 3\n[0] - 0 - 4\n[2] - 2 - 7\n[0] - 1 - 6\n[0] - 1 - 2\n=-=-= end =-=-=\nroot after process [967058694493352534, 9074863511553287875, 9175668587783503505, 795617983756447423]\n=-=-= dump queue =-=-=\n[0] - 0 - 3\n[0] - 0 - 4\n[2] - 2 - 7\n[0] - 1 - 6\n[0] - 1 - 2\n=-=-= end =-=-=\nstore object [10429492316354949938, 2370344531932765292, 10429521298291405618, 65283]\nend store object\nstore player\nend store player\nstore object [4287989799505123917, 6631711623484847052, 4288007636041095757, 65283]\nend store object\nstore player\nend store player\nroot after process [18394279174509955234, 4358874192354805475, 7700522006547596045, 2357955983745049504]\n=-=-= dump queue =-=-=\n[1] - 2 - 7\n[0] - 1 - 6\n[0] - 1 - 2\n[0] - 0 - 5\n[2] - 0 - 4\n=-=-= end =-=-=\nroot after process [18394279174509955234, 4358874192354805475, 7700522006547596045, 2357955983745049504]\n=-=-= dump queue =-=-=\n[0] - 2 - 7\n[0] - 1 - 6\n[0] - 1 - 2\n[0] - 0 - 5\n[2] - 0 - 4\n=-=-= end =-=-=\nstore object [4287989799505254989, 6631711623484847052, 4288007636041095757, 65283]\nend store object\nstore player\nend store player\nstore object [4287989799505189453, 6631711623484847052, 4288007636041095757, 65283]\nend store object\nstore player\nend store player\nstore object [10429492316355015474, 2370344531932765292, 10429521298291405618, 65283]\nend store object\nstore player\nend store player\nstore object [4287989799505123917, 6631711623484847052, 4288007636041095757, 65283]\nend store object\nstore player\nend store player\nroot after process [408090878201851920, 12606799617918281032, 15772081462045812705, 434515526492992495]\n=-=-= dump queue =-=-=\n[1] - 0 - 4\n[0] - 2 - 0\n[0] - 1 - 3\n[2] - 1 - 7\n[0] - 0 - 6\n=-=-= end =-=-=\nroot after process [408090878201851920, 12606799617918281032, 15772081462045812705, 434515526492992495]\n=-=-= dump queue =-=-=\n[0] - 0 - 4\n[0] - 2 - 0\n[0] - 1 - 3\n[2] - 1 - 7\n[0] - 0 - 6\n=-=-= end =-=-=\nstore object [10429492316354949938, 2370344531932765292, 10429521298291405618, 65283]\nend store object\nstore player\nend store player\nstore object [4287989799505254989, 6631711623484847052, 4288007636041095757, 65283]\nend store object\nstore player\nend store player\nstore object [10429492316355015474, 2370344531932765292, 10429521298291405618, 65283]\nend store object\nstore player\nend store player\nroot after process [6480432437839597525, 3758309166926135001, 8190507772588482221, 2641619204713695443]\n=-=-= dump queue =-=-=\n[1] - 1 - 7\n[0] - 0 - 6\n[0] - 2 - 1\n[2] - 0 - 5\n[0] - 1 - 4\n=-=-= end =-=-=\nroot after process [6480432437839597525, 3758309166926135001, 8190507772588482221, 2641619204713695443]\n=-=-= dump queue =-=-=\n[0] - 1 - 7\n[0] - 0 - 6\n[0] - 2 - 1\n[2] - 0 - 5\n[0] - 1 - 4\n=-=-= end =-=-=\nstore object [4287989799505189453, 6631711623484847052, 4288007636041095757, 65283]\nthread 'DryRunThread' panicked at 'Unexpected failure: update leaf with proof fail: MerkleError { source: [115, 145, 165, 177, 30, 57, 229, 245, 65, 75, 174, 232, 192, 226, 189, 119, 185, 254, 199, 193, 155, 17, 186, 241, 76, 28, 128, 159, 114, 0, 89, 45], index: 2047, code: InvalidHash }', /home/richard/.cargo/git/checkouts/zkwasm-73ae3e04549605ed/cc66e1f/crates/host/src/host/merkle_helper/merkle.rs:116:18\nnote: run with `RUST_BACKTRACE=1` environment variable to display a backtrace\n. Please check the logfile on the prover node for details.", "fetch[4099380645510411023, 4288007636041095757, 6631711623484847052, 2035010388109384141, 12884901888, 10640506508122142577, 10429521298291405618, 2370344531932765292, 120271624000914872, 8589934592, 10640506508122142577, 10429521298291405618, 2370344531932765292, 120271624000914872, 281479271677952, 4099380645510411023, 4288007636041095757, 6631711623484847052, 2035010388109384141, 281496451547136, 4099380645510411023, 4288007636041095757, 6631711623484847052, 2035010388109384141, 562975723225089]fetch[4099380645510411023, 4288007636041095757, 6631711623484847052, 2035010388109384141, 12884901888, 10640506508122142577, 10429521298291405618, 2370344531932765292, 120271624000914872, 8589934592, 10640506508122142577, 10429521298291405618, 2370344531932765292, 120271624000914872, 281479271677952, 4099380645510411023, 4288007636041095757, 6631711623484847052, 2035010388109384141, 281496451547136]fetch[4099380645510411023, 4288007636041095757, 6631711623484847052, 2035010388109384141, 12884901888, 10640506508122142577, 10429521298291405618, 2370344531932765292, 120271624000914872, 8589934592, 10640506508122142577, 10429521298291405618, 2370344531932765292, 120271624000914872, 281479271677952]fetch[4099380645510411023, 4288007636041095757, 6631711623484847052, 2035010388109384141, 12884901888, 10640506508122142577, 10429521298291405618, 2370344531932765292, 120271624000914872, 8589934592]fetch[4099380645510411023, 4288007636041095757, 6631711623484847052, 2035010388109384141, 12884901888]=-=-= dump queue =-=-=\n[1] - 2 - 6\n[0] - 1 - 5\n[0] - 1 - 1\n[0] - 0 - 2\n[0] - 0 - 3\n=-=-= end =-=-=\nroot after process [4359988806327673052, 9074689522097986408, 8048241652779801675, 48084975262841424]\n=-=-= dump queue =-=-=\n[0] - 2 - 6\n[0] - 1 - 5\n[0] - 1 - 1\n[0] - 0 - 2\n[0] - 0 - 3\n=-=-= end =-=-=\nstore object [4287989799505254989, 6631711623484847052, 4288007636041095757, 65283]\nend store object\nstore player\nend store player\nstore object [4287989799505189453, 6631711623484847052, 4288007636041095757, 65283]\nend store object\nstore player\nend store player\nstore object [10429492316355015474, 2370344531932765292, 10429521298291405618, 65283]\nend store object\nstore player\nend store player\nstore object [10429492316354949938, 2370344531932765292, 10429521298291405618, 65283]\nend store object\nstore player\nend store player\nstore object [4287989799505123917, 6631711623484847052, 4288007636041095757, 65283]\nend store object\nstore player\nend store player\nroot after process [967058694493352534, 9074863511553287875, 9175668587783503505, 795617983756447423]\n=-=-= dump queue =-=-=\n[1] - 0 - 3\n[0] - 0 - 4\n[2] - 2 - 7\n[0] - 1 - 6\n[0] - 1 - 2\n=-=-= end =-=-=\nroot after process [967058694493352534, 9074863511553287875, 9175668587783503505, 795617983756447423]\n=-=-= dump queue =-=-=\n[0] - 0 - 3\n[0] - 0 - 4\n[2] - 2 - 7\n[0] - 1 - 6\n[0] - 1 - 2\n=-=-= end =-=-=\nstore object [10429492316354949938, 2370344531932765292, 10429521298291405618, 65283]\nend store object\nstore player\nend store player\nstore object [4287989799505123917, 6631711623484847052, 4288007636041095757, 65283]\nend store object\nstore player\nend store player\nroot after process [18394279174509955234, 4358874192354805475, 7700522006547596045, 2357955983745049504]\n=-=-= dump queue =-=-=\n[1] - 2 - 7\n[0] - 1 - 6\n[0] - 1 - 2\n[0] - 0 - 5\n[2] - 0 - 4\n=-=-= end =-=-=\nroot after process [18394279174509955234, 4358874192354805475, 7700522006547596045, 2357955983745049504]\n=-=-= dump queue =-=-=\n[0] - 2 - 7\n[0] - 1 - 6\n[0] - 1 - 2\n[0] - 0 - 5\n[2] - 0 - 4\n=-=-= end =-=-=\nstore object [4287989799505254989, 6631711623484847052, 4288007636041095757, 65283]\nend store object\nstore player\nend store player\nstore object [4287989799505189453, 6631711623484847052, 4288007636041095757, 65283]\nend store object\nstore player\nend store player\nstore object [10429492316355015474, 2370344531932765292, 10429521298291405618, 65283]\nend store object\nstore player\nend store player\nstore object [4287989799505123917, 6631711623484847052, 4288007636041095757, 65283]\nend store object\nstore player\nend store player\nroot after process [408090878201851920, 12606799617918281032, 15772081462045812705, 434515526492992495]\n=-=-= dump queue =-=-=\n[1] - 0 - 4\n[0] - 2 - 0\n[0] - 1 - 3\n[2] - 1 - 7\n[0] - 0 - 6\n=-=-= end =-=-=\nroot after process [408090878201851920, 12606799617918281032, 15772081462045812705, 434515526492992495]\n=-=-= dump queue =-=-=\n[0] - 0 - 4\n[0] - 2 - 0\n[0] - 1 - 3\n[2] - 1 - 7\n[0] - 0 - 6\n=-=-= end =-=-=\nstore object [10429492316354949938, 2370344531932765292, 10429521298291405618, 65283]\nend store object\nstore player\nend store player\nstore object [4287989799505254989, 6631711623484847052, 4288007636041095757, 65283]\nend store object\nstore player\nend store player\nstore object [10429492316355015474, 2370344531932765292, 10429521298291405618, 65283]\nend store object\nstore player\nend store player\nroot after process [6480432437839597525, 3758309166926135001, 8190507772588482221, 2641619204713695443]\n=-=-= dump queue =-=-=\n[1] - 1 - 7\n[0] - 0 - 6\n[0] - 2 - 1\n[2] - 0 - 5\n[0] - 1 - 4\n=-=-= end =-=-=\nroot after process [64804324
yymone commented 1 month ago

This is a little bit weird. I checked the 2 errors from your copied log: The first error for dry run panic happen at 669d7f87a7007a2f46e6257e, which means dry run failed relatively old task on 22/07/2024 The seconed try for dry run panic happen at 66b71ebdb5fc356808821450, which means dry run failed on a newer (later) task on 10/08/2024.

As it is not failed at the same task, I suspect it related to the local db error and the local db internally data got issue randomly after processing 100k tasks dry run. May related to local env.

I suggest to do the process again:

  1. Stop all container by "docker compose down"
  2. docker volume rm prover-node-docker_workspace-volume to remove workspace volume
  3. Under the prover-node-docker folder, remove the mongo dir to clean the db
  4. docker compose up to start again.

to clean mongodb again and retry again.

Then we can see whether it panic again and which task it failed to confirm it is randomly failed or failed on the same task id. And if it is still randomly failed, could you send the all dry run logs under the folder: /var/lib/docker/volumes/prover-node-docker_dry-run-logs-volume/_data

(Assuming you are using defualt settings for log file places and db location)

and maybe we can check whether their are any abnormal things happen before this error from mongodb.

kii-awesome commented 1 month ago

@yymone last night Sinka Gao requested access to my ssh, you should be able to see it.

Sinka Gao, [15/08/2024 20:39] might need to give me ssh to see what is going on

Kii | genznodes, [15/08/2024 20:41] sure

Sinka Gao, [15/08/2024 20:43] thank you, I will have a look

Sinka Gao, [15/08/2024 20:43] thank you for reporting

Kii | genznodes, [15/08/2024 20:43] lmk if u can access

Sinka Gao, [15/08/2024 20:43] it is about 0:am, I will study the case tomorrow morning

Sinka Gao, [15/08/2024 20:44] Will let you know if I find anything

yymone commented 1 month ago

If your machine is 91.218.113.131, we figure out the root cause.

The machine is not safe and maybe expose 27017 to public. As our internal mongodb do not have password, so some hacker auto scan all ips and find this machine and remove the dry run database and blackmail you.

I just started the mongod container and check the content inside it, the db change to : image

This is the reason why the dry run cannnot continue.

Please fix the security issue and then do the processes again:

  1. Stop all container by "docker compose down"
  2. docker volume rm prover-node-docker_workspace-volume to remove workspace volume
  3. Under the prover-node-docker folder, remove the mongo dir to clean the db
  4. docker compose up to start again.
kii-awesome commented 1 month ago

ah yes it's my bad, thanks for the response and help