Open teor2345 opened 2 days ago
This issue still happens after the code changes in:
Runtime::shutdown_timeout
on the tokio runtime after the main farmer function returnsI added some logs to see where the hang was:
2024-10-28T03:04:35.739207Z ERROR {farm_index=0}: subspace_farmer::single_disk_farm::piece_reader: Failed to read piece from sector sector_index=6 piece_offset=651 error=Invalid chunk at location 20531677 s-bucket 32771 encoded true, possible disk corruption: Invalid scalar
^C2024-10-28T03:04:39.384097Z INFO subspace_farmer::utils: Received SIGINT, shutting down farmer...
2024-10-28T03:04:39.384111Z INFO subspace_farmer::commands::farm: signal select branch
2024-10-28T03:04:39.384113Z INFO subspace_farmer::commands::farm: end of async fn farm()
2024-10-28T03:05:10.236106Z ERROR {farm_index=0}: subspace_farmer::single_disk_farm::piece_reader: Failed to read piece from sector sector_index=6 piece_offset=671 error=Invalid chunk at location 15396453 s-bucket 24576 encoded true, possible disk corruption: Invalid scalar
The farmer is blocked at the end of the farm()
function. None of the logs I added in main()
got logged.
Aha, it did exit, it just took 17 minutes.
Full logs (including logs that show where in the code it hung):
Setup
On macOS 13.7 on M1 Max, I ran the following commands:
This bug happens with the binaries from:
3177
Root Cause
I understand the network is partly shut down, and the farmer's storage might be corrupted. But I'm not sure how the corruption happened, I was running the devnet binaries as part of the test network. Unfortunately I don't have logs of the corruption itself, because my terminal only saves a few thousand lines.
Whatever the root cause was, the farmer shouldn't hang when it's newly started, even if its storage is corrupt.
Errors
There were a bunch of "invalid chunk" errors, so I pressed Ctrl-C:
But the node didn't exit after I pressed Ctrl-C, or when I used kill (SIGTERM) on its pid:
I had to use
kill -KILL
to get it to exit.Where it's hanging
The farmer is hanging somewhere in the piece reading code, but I'm not sure exactly which function is blocking it exiting.
Here's what I got from
flamegraph --pid PID
after the Ctrl-C when the farmer didn't shut down:Full Logs
The full logs are: