Open ldeffenb opened 3 months ago
I was able to reproduce the problem. While it does not happen if the node is inactive at the time of the crash, it does happen if the node crashes while you are uploading a big file.
I suspect this is the exact cause of our data corruptions.
This time it wasn't a host crash, but an external USB-3 SSD drive decided to drop to read-only. Once I ran fsck
, I did a db validate
and here's the results for the 8 sepolia testnet nodes on that disk.
"time"="2024-09-15 04:18:17.003541" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h4m9.981821568s" "invalid"=107 "soc"=820971 "total"=2877575
"time"="2024-09-15 04:41:19.907673" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h26m49.780242364s" "invalid"=157 "soc"=1131450 "total"=3643401
"time"="2024-09-15 04:54:48.560921" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h40m5.61195841s" "invalid"=43 "soc"=2783885 "total"=6848446
"time"="2024-09-15 04:47:15.433710" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h32m6.103982111s" "invalid"=558 "soc"=1132018 "total"=3882796
"time"="2024-09-15 04:45:17.990865" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h29m51.167401584s" "invalid"=138 "soc"=1132112 "total"=3738894
"time"="2024-09-15 04:44:32.199148" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h28m35.409959412s" "invalid"=111 "soc"=1133881 "total"=3638833
"time"="2024-09-15 04:22:26.020211" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h6m11.076314829s" "invalid"=68 "soc"=821095 "total"=2874815
"time"="2024-09-15 04:45:55.153415" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h29m9.896151561s" "invalid"=73 "soc"=1133914 "total"=3647901
"time"="2024-09-15 04:47:58.990756" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h30m58.795960225s" "invalid"=30 "soc"=1132307 "total"=3649879
"time"="2024-09-15 04:47:15.763540" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h29m59.39903985s" "invalid"=149 "soc"=1132293 "total"=3643709
"time"="2024-09-15 04:48:36.339089" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h31m5.163448941s" "invalid"=49 "soc"=1132224 "total"=3656289
I was actively uploading a test OSM dataset to the sepolia swarm when the disk went offline.
Also, I hadn't noticed this type of validation error before. Maybe the sharky extended just before the crash and got reverted by the fsck
? But what is really concerning is that the invalid count doesn't seem to increase when this error is logged.
"time"="2024-09-15 04:47:12.510417" "level"="warning" "logger"="node" "msg"="invalid chunk" "address"="79acfe7a19aaf5ad494f4deecac6cf978fac54bf4937bcb60fb1b4b0162cbd4f" "timestamp"="2024-09-15 02:52:52 +0000 UTC" "location"="shard: 4, slot: 213768, length: 4104" "error"="read 0: EOF"
"time"="2024-09-15 04:47:13.357727" "level"="warning" "logger"="node" "msg"="invalid chunk" "address"="79bc7bd94a8a4354780b7a4caf1f031d0fcab38b7f26511fd17fe9f757e6ecbb" "timestamp"="2024-09-15 02:52:54 +0000 UTC" "location"="shard: 4, slot: 213775, length: 1080" "error"="read 0: EOF"
"time"="2024-09-15 04:47:18.822251" "level"="warning" "logger"="node" "msg"="invalid chunk" "address"="7a2eeb81caf76f66c00c35676b09982c0843a27ce4f8a0491198fe2d6fa635a1" "timestamp"="2024-09-15 02:52:54 +0000 UTC" "location"="shard: 4, slot: 213776, length: 4104" "error"="read 0: EOF"
I re-ran the db validate
on the node where I noticed these errors, and the errors are still there but do NOT count as invalid chunks. This bodes very not well for that node if I don't nuke
it. And I just realized that this is the node that was actually doing the upload. The other nodes were simply receiving the pushed chunks.
@Fatima-yo please share your findings here
The issue exists, it is a bit tricky.
If the Bee node is running idle, when PC/OS/drive crashes, it won't get corrupted. If you are uploading a file, and PC/OS/drive crashes, it WILL get corrupted.
Easy to reproduce if you take this into account.
Context
v2.1.0 on sepolia testnet swarm
Summary
I had 13 nodes running on a single host, 2 on an internal NVME SSD and the other 11 on a USB-3-connected SATA SSD. Due to reasons outside of swarm, this host crashed without warning. On powering up, I discovered that all 11 nodes on the external SSD had sharky corruptions as indicated by
db validate
. The 2 nodes on the internal NVME SSD validated cleanly.It is also worth noting that this host crashed while it was very active pushing a new OSM hybrid-redundancy dataset into the testnet swarm. The nodes were definitely NOT idle when the crash happened.
Expected behavior
I would hope that recovery from a system crash would not cause a corrupted sharky. But I can almost understand it with a USB-connected drive.
Actual behavior
Verification results from multiple nodes on that single host.
Steps to reproduce
I guess if you have a very active node and crash the host with the data-dir on an external USB drive, you may be able to duplicate the corruption.
Possible solution
Don't run with data-dir on an external drive?
At least it may be worth noting somewhere in the documentation that running on external drives may be at a risk for data corruption.