codex-storage / nim-codex

Decentralized Durability Engine
58 stars 21 forks source link

Scheduling prover on another thread #808

Open elcritch opened 1 month ago

elcritch commented 1 month ago

The prover currently blocks the main Chronos thread. It should be scheduled on thread similar to the erasure coding as done in https://github.com/codex-storage/nim-codex/pull/716.

elcritch commented 1 month ago

Notes on the circom-compat prover and Nim / Rust data ownership needed to do the threading properly:

elcritch commented 1 month ago

Notes on CircomBn254Cfg:

TODO: figure out a simple way to make CircomBn254Cfg for each worker thread?

Next handling the CircomCompatCtx is relatively simple:

elcritch commented 1 month ago

Updates: Adding a thread using taskpools results in memory corruption and crashes in either the Rust library code or on the Nim side after calling into the Rust library.

I've tried lots of variations on the problem. Initially my thinking was that there was something fairly simple like using sharing a piece of mutable memory between threads. However, I'm fairly competent I've tracked down all the places on the inputs where this occurs.

Things I've tried which haven't resolved issue:

The Nim side appears to be working fine running the builder and creating the CircomBn254Cfg or CircomCompatCtx objects, however, as soon as prove is called it begins corrupting memory . Printing Nim inputs to the builder shows corruption in the builder keys, despite using to_owned to ensure the Rust will make it's own copy of the strings.

elcritch commented 1 month ago

Also, the proof returns appear to be working fine as are sending the arguments to the taskpool workers. Even running only a single thread worker which would prevent race conditions results in corrupted memory.

My current best guess is that something in the CircomCompat or Wasmer libraries assume they're running on the main thread, or have a thread local resource that is instantiated internally.

elcritch commented 1 month ago

I'm going to setup a docker image so I can run valgrind on my mac and try to see what's going on.

Example parameter corruption below:

TASK: task: proof: success((a: (x: [252, 125, 205, 30, 91, 169, 208, 144, 232, 29, 242, 124, 78, 116, 186, 167, 118, 99, 221, 210, 215, 182, 33, 41, 207, 198, 113, 40, 80, 145, 99, 18], y: [173, 31, 56, 59, 77, 216, 231, 124, 23, 188, 79, 117, 200, 10, 199, 13, 116, 69, 134, 172, 56, 116, 225, 203, 14, 136, 127, 175, 192, 153, 96, 35]), b: (x: [[94, 109, 223, 96, 64, 52, 178, 201, 60, 203, 105, 119, 66, 155, 205, 209, 8, 234, 163, 208, 173, 164, 91, 83, 220, 48, 137, 72, 79, 234, 199, 29], [250, 128, 38, 89, 24, 148, 228, 232, 88, 216, 131, 10, 178, 203, 128, 242, 4, 210, 81, 200, 180, 31, 163, 127, 214, 163, 178, 66, 67, 50, 63, 45]], y: [[194, 156, 184, 152, 149, 70, 210, 6, 167, 99, 182, 182, 40, 60, 98, 158, 117, 67, 86, 45, 125, 183, 118, 193, 254, 255, 4, 210, 31, 119, 169, 6], [81, 217, 255, 215, 196, 250, 117, 10, 119, 9, 141, 91, 234, 103, 221, 104, 135, 113, 136, 254, 9, 91, 56, 152, 110, 99, 15, 199, 245, 82, 124, 40]]), c: (x: [55, 23, 205, 200, 58, 113, 165, 31, 166, 57, 161, 90, 103, 18, 239, 162, 36, 18, 25, 165, 218, 114, 234, 182, 137, 191, 139, 72, 65, 74, 51, 34], y: [108, 100, 140, 89, 122, 70, 78, 164, 197, 68, 107, 10, 252, 80, 14, 236, 73, 215, 139, 193, 93, 129, 50, 85, 108, 117, 18, 170, 79, 172, 153, 10])))
TASK: task: params POST: (slotDepth: 32, datasetDepth: 8, blkDepth: 5, cellElms: 67, numSamples: 5, r1csPath: "tests/circuits/fixtures/proof_main.r1cs", wasmPath: "tests/circuits/fixtures/proof_main.wasm", zkeyPath: "")
TASK: task: 
TASK: task: params: 0x114f0f380
TASK: task: params: (slotDepth: 32, datasetDepth: 8, blkDepth: 5, cellElms: 67, numSamples: 5, r1csPath: "tests/circuits/fixtures/proof_main.r1cs", wasmPath: "tests/circuits/fixtures/proof_main.wasm", zkeyPath: "")
TASK: task: -2001899808422488719
TASK: task spawn: params: 0x104c0e5b0
PROVE: 17
TASK: task spawn: params: 0x104c0edd0
PROVE: 18
TASK: task spawn: params: 0x104d37790
PROVE: 19
TASK: task spawn: params: 0x104c42ec0
PROVE: 20
TASK: task spawn: params: 0x104dda650
PROVE: 21
TASK: task spawn: params: 0x104c42a60
PROVE: 22
TASK: task spawn: params: 0x104c9bce0
PROVE: 23
TASK: task: proof: success((a: (x: [81, 35, 236, 68, 189, 99, 225, 10, 196, 226, 243, 68, 50, 229, 30, 152, 240, 87, 72, 199, 57, 148, 7, 211, 54, 247, 46, 101, 246, 118, 200, 42], y: [110, 79, 55, 98, 72, 199, 47, 7, 192, 80, 38, 5, 219, 105, 202, 90, 17, 243, 80, 71, 209, 92, 132, 24, 102, 191, 198, 233, 72, 220, 203, 40]), b: (x: [[228, 140, 96, 208, 2, 19, 115, 179, 216, 137, 145, 45, 135, 211, 132, 61, 237, 36, 11, 213, 234, 52, 143, 255, 111, 204, 50, 10, 76, 208, 244, 38], [99, 220, 252, 121, 17, 173, 194, 225, 147, 17, 36, 174, 113, 93, 227, 65, 147, 197, 162, 64, 105, 66, 150, 147, 132, 239, 81, 95, 24, 132, 126, 35]], y: [[82, 132, 204, 152, 211, 196, 92, 195, 40, 252, 126, 184, 31, 50, 41, 169, 69, 240, 230, 19, 95, 96, 114, 95, 41, 207, 78, 137, 21, 202, 152, 3], [153, 197, 8, 14, 251, 110, 63, 127, 55, 161, 111, 251, 170, 19, 163, 79, 113, 103, 209, 110, 35, 245, 203, 20, 84, 47, 247, 117, 184, 102, 251, 25]]), c: (x: [219, 122, 190, 26, 79, 118, 204, 95, 179, 74, 2, 48, 128, 94, 191, 208, 113, 163, 252, 107, 148, 228, 24, 230, 39, 188, 160, 187, 125, 197, 176, 33], y: [237, 41, 43, 42, 44, 61, 81, 206, 218, 6, 61, 247, 33, 229, 6, 105, 12, 216, 125, 145, 128, 144, 243, 200, 103, 100, 13, 66, 19, 154, 157, 27])))
TASK: task: params POST: (slotDepth: 32, datasetDepth: 8, blkDepth: 5, cellElms: 67, numSamples: 5, r1csPath: "tests/circuits/fixtures/proof_main.r1cs", wasmPath: "tests/circuits/fixtures/proof_main.wasm", zkeyPath: "")
TASK: task: 
TASK: task: params: 0x114f0f4c0
TASK: task: params: (slotDepth: 32, datasetDepth: 8, blkDepth: 5, cellElms: 67, numSamples: 5, r1csPath: "\x12 ��B��\x1C\x14���șo�$\'�A�d��L���\exR�U", wasmPath: "\x12 R߆;��AKV,:�D�\x13=\v���\e\x11��Q���\x15��", zkeyPath: "")