sharky corruption after host crash

ldeffenb commented 4 months ago

Context

v2.1.0 on sepolia testnet swarm

Summary

I had 13 nodes running on a single host, 2 on an internal NVME SSD and the other 11 on a USB-3-connected SATA SSD. Due to reasons outside of swarm, this host crashed without warning. On powering up, I discovered that all 11 nodes on the external SSD had sharky corruptions as indicated by db validate. The 2 nodes on the internal NVME SSD validated cleanly.

It is also worth noting that this host crashed while it was very active pushing a new OSM hybrid-redundancy dataset into the testnet swarm. The nodes were definitely NOT idle when the crash happened.

Expected behavior

I would hope that recovery from a system crash would not cause a corrupted sharky. But I can almost understand it with a USB-connected drive.

Actual behavior

Verification results from multiple nodes on that single host.

"time"="2024-07-26 13:22:49.541998" "level"="info" "logger"="node" "msg"="validation finished" "duration"="40m1.635440786s" "invalid"=2 "soc"=1319448 "total"=2137804
"time"="2024-07-26 13:23:33.389734" "level"="info" "logger"="node" "msg"="validation finished" "duration"="40m29.69367306s" "invalid"=41 "soc"=1321454 "total"=2140554
"time"="2024-07-26 13:30:12.261303" "level"="info" "logger"="node" "msg"="validation finished" "duration"="49m38.373076047s" "invalid"=228 "soc"=4198475 "total"=6213034
"time"="2024-07-26 13:24:33.398450" "level"="info" "logger"="node" "msg"="validation finished" "duration"="41m12.410722607s" "invalid"=148 "soc"=1336141 "total"=2159790
"time"="2024-07-26 13:24:23.690402" "level"="info" "logger"="node" "msg"="validation finished" "duration"="40m45.640637335s" "invalid"=22 "soc"=1329395 "total"=2150760
"time"="2024-07-26 13:24:28.904050" "level"="info" "logger"="node" "msg"="validation finished" "duration"="40m36.74342581s" "invalid"=47 "soc"=1319130 "total"=2137422
"time"="2024-07-26 13:24:36.198084" "level"="info" "logger"="node" "msg"="validation finished" "duration"="40m24.78211206s" "invalid"=31 "soc"=1319296 "total"=2137618
"time"="2024-07-26 13:24:15.713920" "level"="info" "logger"="node" "msg"="validation finished" "duration"="39m40.56859582s" "invalid"=8 "soc"=1319289 "total"=2137613
"time"="2024-07-26 13:20:43.081423" "level"="info" "logger"="node" "msg"="validation finished" "duration"="35m53.235082068s" "invalid"=88 "soc"=1325899 "total"=2146762
"time"="2024-07-26 13:24:59.632308" "level"="info" "logger"="node" "msg"="validation finished" "duration"="40m5.682027361s" "invalid"=10 "soc"=1329492 "total"=2151330
"time"="2024-07-26 13:25:11.043347" "level"="info" "logger"="node" "msg"="validation finished" "duration"="40m12.737092877s" "invalid"=5 "soc"=1366998 "total"=2201914

Steps to reproduce

I guess if you have a very active node and crash the host with the data-dir on an external USB drive, you may be able to duplicate the corruption.

Possible solution

Don't run with data-dir on an external drive?

At least it may be worth noting somewhere in the documentation that running on external drives may be at a risk for data corruption.

Fatima-yo commented 2 months ago

I was able to reproduce the problem. While it does not happen if the node is inactive at the time of the crash, it does happen if the node crashes while you are uploading a big file.

tmm360 commented 2 months ago

I suspect this is the exact cause of our data corruptions.

ldeffenb commented 2 months ago

This time it wasn't a host crash, but an external USB-3 SSD drive decided to drop to read-only. Once I ran fsck, I did a db validate and here's the results for the 8 sepolia testnet nodes on that disk.

"time"="2024-09-15 04:18:17.003541" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h4m9.981821568s" "invalid"=107 "soc"=820971 "total"=2877575
"time"="2024-09-15 04:41:19.907673" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h26m49.780242364s" "invalid"=157 "soc"=1131450 "total"=3643401
"time"="2024-09-15 04:54:48.560921" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h40m5.61195841s" "invalid"=43 "soc"=2783885 "total"=6848446
"time"="2024-09-15 04:47:15.433710" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h32m6.103982111s" "invalid"=558 "soc"=1132018 "total"=3882796
"time"="2024-09-15 04:45:17.990865" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h29m51.167401584s" "invalid"=138 "soc"=1132112 "total"=3738894
"time"="2024-09-15 04:44:32.199148" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h28m35.409959412s" "invalid"=111 "soc"=1133881 "total"=3638833
"time"="2024-09-15 04:22:26.020211" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h6m11.076314829s" "invalid"=68 "soc"=821095 "total"=2874815
"time"="2024-09-15 04:45:55.153415" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h29m9.896151561s" "invalid"=73 "soc"=1133914 "total"=3647901
"time"="2024-09-15 04:47:58.990756" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h30m58.795960225s" "invalid"=30 "soc"=1132307 "total"=3649879
"time"="2024-09-15 04:47:15.763540" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h29m59.39903985s" "invalid"=149 "soc"=1132293 "total"=3643709
"time"="2024-09-15 04:48:36.339089" "level"="info" "logger"="node" "msg"="validation finished" "duration"="1h31m5.163448941s" "invalid"=49 "soc"=1132224 "total"=3656289

I was actively uploading a test OSM dataset to the sepolia swarm when the disk went offline.

Also, I hadn't noticed this type of validation error before. Maybe the sharky extended just before the crash and got reverted by the fsck? But what is really concerning is that the invalid count doesn't seem to increase when this error is logged.

"time"="2024-09-15 04:47:12.510417" "level"="warning" "logger"="node" "msg"="invalid chunk" "address"="79acfe7a19aaf5ad494f4deecac6cf978fac54bf4937bcb60fb1b4b0162cbd4f" "timestamp"="2024-09-15 02:52:52 +0000 UTC" "location"="shard: 4, slot: 213768, length: 4104" "error"="read 0: EOF"
"time"="2024-09-15 04:47:13.357727" "level"="warning" "logger"="node" "msg"="invalid chunk" "address"="79bc7bd94a8a4354780b7a4caf1f031d0fcab38b7f26511fd17fe9f757e6ecbb" "timestamp"="2024-09-15 02:52:54 +0000 UTC" "location"="shard: 4, slot: 213775, length: 1080" "error"="read 0: EOF"
"time"="2024-09-15 04:47:18.822251" "level"="warning" "logger"="node" "msg"="invalid chunk" "address"="7a2eeb81caf76f66c00c35676b09982c0843a27ce4f8a0491198fe2d6fa635a1" "timestamp"="2024-09-15 02:52:54 +0000 UTC" "location"="shard: 4, slot: 213776, length: 4104" "error"="read 0: EOF"

I re-ran the db validate on the node where I noticed these errors, and the errors are still there but do NOT count as invalid chunks. This bodes very not well for that node if I don't nuke it. And I just realized that this is the node that was actually doing the upload. The other nodes were simply receiving the pushed chunks.

nikipapadatou commented 1 month ago

@Fatima-yo please share your findings here

Fatima-yo commented 1 month ago

The issue exists, it is a bit tricky.

If the Bee node is running idle, when PC/OS/drive crashes, it won't get corrupted. If you are uploading a file, and PC/OS/drive crashes, it WILL get corrupted.

Easy to reproduce if you take this into account.

nikipapadatou commented 1 week ago

LevelDB does not have a way to protect against crashes - will investigate it and maybe consider a different approach to tackle the issue | cc @istae

ethersphere / bee