codex-storage / nim-codex

Decentralized Durability Engine
Apache License 2.0
63 stars 23 forks source link

[BUG] Clock update failure #820

Open benbierens opened 3 months ago

benbierens commented 3 months ago

[2024-06-03T09:22:57.1252625Z] Container Crash Log for 'Host4'. [2024-06-03T09:22:57.1370914Z] DBG 2024-06-03 09:22:52.924+00:00 error updating clock: topics="contracts clock" tid=1 error="Incomplete data sent or received" count=24 [2024-06-03T09:22:57.1371843Z] ERR 2024-06-03 09:22:52.924+00:00 Codex failed to start topics="codex" tid=1 error="Transport is not initialised (missing a call to connect?)

[2024-06-03T09:23:07.0821101Z] Container Crash Log for 'Host3'. [2024-06-03T09:23:07.0974609Z] DBG 2024-06-03 09:23:00.565+00:00 error updating clock: topics="contracts clock" tid=1 error="Incomplete data sent or received" count=24 [2024-06-03T09:23:07.0975374Z] ERR 2024-06-03 09:23:00.566+00:00 Codex failed to start topics="codex" tid=1 error="Transport is not initialised (missing a call to connect?)" count=25

[2024-06-03T09:23:07.1719014Z] Container Crash Log for 'Host5'. [2024-06-03T09:23:07.1855021Z] DBG 2024-06-03 09:22:57.471+00:00 error updating clock: topics="contracts clock" tid=1 error="Incomplete data sent or received" count=24 [2024-06-03T09:23:07.1856625Z] ERR 2024-06-03 09:22:57.471+00:00 Codex failed to start topics="codex" tid=1 error="Transport is not initialised (missing a call to connect?)" count=25

[2024-06-03T09:23:17.2687901Z] Container Crash Log for 'Host7'. [2024-06-03T09:23:17.2829670Z] DBG 2024-06-03 09:23:03.344+00:00 error updating clock: topics="contracts clock" tid=1 error="Incomplete data sent or received" count=24 [2024-06-03T09:23:17.2830868Z] ERR 2024-06-03 09:23:03.344+00:00 Codex failed to start topics="codex" tid=1 error="Transport is not initialised (missing a call to connect?)" count=25

In same test (same Geth instance, same contracts deployment) Host6 did not crash.

Location clock.nim line 41: debug "error updating clock: ", error=error.msg

Context: I was running the Marketplace dist-test several times with the first image that contains the LevelDB-datastore. This happened: Pass Fail (described above) Pass Pass Pass Pass Pass

What could cause this? Was it just a fluke? Or could this be another race condition? If so, it is very unusual that 4 of the 5 hosts would display the same failure, in the same test run, and then not see it again. 4/5 failures suggests a certain likelihood, but 6 successful runs suggest it's very unlikely.

[2024-06-03T10:48:16.3959907Z] *** Finished: [MarketplaceExample] = Failed [2024-06-03T10:48:16.3963936Z] System.TimeoutException : Retry 'Checking SlotFilled events' timed out after 51 tries over 4 mins, 13 secs.

markspanbroek commented 3 months ago

This indicates that the codex nodes could not connect to the RPC endpoint of the Ethereum node; the HTTP or Websocket connection failed to connect. Did all the nodes that crashed connect to the same Geth node? If so, then the Geth node may have crashed, or perhaps the network failed.

benbierens commented 3 months ago

This seems a likely explanation. So the only issue here is that Codex is not hardened yet against a crappy RPC endpoint.