Error correction on a single drive

ErrorNoInternet commented 10 months ago

Could error correcting codes be stored so that bad sectors and bit-flips could be transparently corrected? This would be useful on laptops with only a single internal drive. DUP causes you to lose half of your usable space, and RAID5 with multiple partitions is extremely slow as the drive has to seek across multiple partitions. Technically par2 could be used but it isn't suitable for stuff like 100+ GB virtual machine images or files that are constantly updating.

nefelim4ag commented 9 months ago

It's much more complicated. In your case, if you care - make backups.

HDD/SSD have error correction codes and correct bitrot errors themselfs. In case of fail, disk will return error to I/O requests and nothing can be done on software side.

So, to handle bad sectors. we need raid on different disks or on same disk (dup). If you ask to make erasure encoded sector level correction - that will require format change and you still have write amplification, and it will be slow.

About bit-flips, that's also tricky. AFAIK in metadata btrfs has 32 bytes to store checksums. Currently by default only 4 bytes for CRC32 is used. CRC32 can only reliable detect errors up to N (4-5 bits for 4k) (because of possible hash collisions) - so if we do any error correction here it can not reliably work.

In case of xxhash, we use 8 bytes hash sum, with better resistance to collisions. It possible to do something about it. 32 - 8 = 24 bytes are unused in each hash record. Without large changes we can do simple XOR (raid5) over each 16 bytes in sector and store bit sum in metadata. That will cost time & cpu, but let's ignore it for now.

In case of checksum error, we can simply try to pruteforce and recompute each possible combination of 4k sector with hash sum ~ 256 attemts, and after each recompute - check checksum, if checksum matched that means we succsesfuly corrected errors (up to consequative 256 bits). Cool right?

Taking into account behavior of disks in case of errors, the only possible case where it will helps - in memory bitrot. Bit it also can not work, because we just don't know when it happens. Literraly. We can have bitrotted checksum - no chanse to restore. We can have bitrotted data - that the case when it can work.

Does it make sense to pay additional computation costs and require every one, who want this, to use xxhash + xor interdata checksum? I don't know.

Any way, it will not help with faulty disk, it will not fix sector IO errors - so you will lose your currepted data, and you still need to replace it.

jelazual commented 5 months ago

I was thinking this would be a good feature some months ago as well, and I think there are good reasons why it would be worth adding.

First of all, HDD/SSD controllers are a black box and cannot be trusted to ecc/csum with 100% reliability. If they could, we wouldn't need csum in the file system, we could just write bytes. So, certainly, because we csum, and we raid, we should consider that single-drive bitrot detection is worth thinking about too.

Furtherly, it can be possible to do parity bitrot checks on arbitrary redundancy levels, like with parchive, using block maps. Reserving 1% of a drive's data, every 100 blocks of the filesystem, you have 1 blocks parity striped into the data to check the previous 100 blocks data integrity, and correct up to 1 blocks data in that space, in the case of a csum mismatch or unreadable block. It may also be better to do 10X to 10 blocks or some other larger number chunks, to avoid adjacent bitrot problems, so you can have 10 bad reads per, say, 1000 blocks of data. Being able to assign arbitrary %s of a drive to parity. This is good for offline/cold storage where multiple sectors/blocks may become unreadable. Already, we can use Parchive to achieve this, but Parchive is very extremely slow, and btrfs could achieve this kind of parity check much more quickly, albeit less reliably and less transparently, while the data is in flight to the drive. Good for "not super important but it would be nice to have some bitrot protection" HDD shelf storage, which is presently the cheapest way to archive large quantities of data. Since it's storing data in flight, too, much of the data csum/block parity can be calculated in memory and save thrashing the read head, in cases of systems with more ram.

It could also be used on a per-subvolume basis, for ensuring important personal data has some degree of recoverability in case of data corruption. btrfs is very good at detecting corruption, and there are still ways to consistently get corrupted data in cases where it shouldn't, such as corrupting data in subvolume snapshots by imaging an entire drive while a file in the root subvolume is in use, ie disk imaging an active drive with firefox open will corrupt the cookies.sqlite file. csum and a small amount of parity could recover from these small errors smoothly. There are also cases where a drive will write bad data because it is nearly full, and parity could help recover overwritten data to a previous usable state in many cases.

I'm not sure parity actually can help with the above two cases in an in-flight filesystem, since the parity may be overwritten as well with bad data, but I think it's possible to have a (write -> check -> parity if integrity) process for at least some of these cases, or possibly snapshot parity with the subvolume to improve snapshot integrity at least. There's really no good reason a snapshot of a subvolume should face data corruption from writes to the source subvolume, but it can still happen.

We should have parity for sanity and because we don't trust software or hardware, and the best solution is not always the available one. I think it might already technically be possible to do something like this using MDADM and multiple btrfs partitions, but it's a bad solution with many many more problems than an in-fs parity reserve would have.

Ed: though, per subvolume parity could present a lot of problems with differing parity values across subvolume snapshots and a great deal of data magnification when you have lots of subvolumes. It might be best only to inherit parity on read-only snapshots, and to throw away parity flag when creating RW snapshots, to help manage parity overflow. Would also likely be a problem for in-flight deduplication. Per-subvol parity is perhaps too complicated a solution to have in-flight, but instead an option to scan and create for read only subvolumes?

btrfs / btrfs-todo

Error correction on a single drive #51