Open jhermsmeier opened 7 years ago
and see if its feasible to get a percentage of failed blocks somehow (then we can figure out what to do with the data).
Keeping track of a percentage should be pretty straight forward, as we only need to count blocks written over blocks failed (basically adding to counters, which should be trivial with the "new" block-write-streams)
It shouldn't even be too hard to keep the blocks that failed to be written, as long as it's not a high percentage failing (or we just keep the block addresses, then even that wouldn't be much of a problem until we hit really high percentages on large drives).
Just to clarify - #735 wasn't talking about block-write failures (I believe @jviotti made a change so that write-failures get tried something like 10 times before giving up), but about block-verify failures (which implies that we'd need to change the checksumming process).
It shouldn't even be too hard to keep the blocks that failed to be written, as long as it's not a high percentage failing (or we just keep the block addresses, then even that wouldn't be much of a problem until we hit really high percentages on large drives).
Maybe keep a precise count up to a certain limit, and if that limit is surpassed, just keep and report a percentage? (e.g: 25% failed). If a lot of blocks failed, detailed information is not very useful anymore.
but about block-verify failures (which implies that we'd need to change the checksumming process).
Yeah, exactly, that is the challenge of this feature.
but about block-verify failures (which implies that we'd need to change the checksumming process).
Oh, sorry – I somehow missed that. Well, if we want to track which blocks didn't check out while verifying, we will need to use entirely different hashing mechanisms like Merkle trees (used in Bittorrent, IPFS, and other filesystems) or rolling hashes, such as rabin fingerprinting (used by LBFS, the dat project, and probably a better choice for what we're thinking about here). CRC32 (which is prone to collisions), MD5, and the SHA-family (or similar) won't do us much good in this case.
I'm starting to think it could actually make sense to drop the full disk CRC / MD5 / SHA / etc checksumming entirely, and only verify the source image with those (i.e. if a file of the same basename, but a .md5
extension is present) – possibly, but not necessarily before even starting to flash the image, so we know that the source is OK.
Then we could calculate rabin fingerprints of the block-stream while writing, and verify the flashed device with those afterwards – that would give us the ability to determine which blocks exactly were corrupted, compute a percentage, etc, etc.
Following that, we'd basically have some more options:
But presumably any rolling-hash or fingerprinting scheme would have to be a tradeoff between the blocksize used and the memory used to store the whole result for a potentially multi-gigabyte disk image, which might have been streamed from the internet?
Pinging @petrosagg as he might want to join in the continuation of the conversation from #735
I think the memory requirements should be low enough to just keep the hashes in memory – except for the smallest block sizes (which would be terribly inefficient to write anyways):
image | image size | block size | block count | hash length | memory required |
---|---|---|---|---|---|
Raspbian Jesse | 4371513344 B (4.07 GB) | 262144 B (256 KB) | 16676 | 32 B | 533632 B (~521 KB) |
Raspbian Jesse | 4371513344 B (4.07 GB) | 512 B (0.5 KB) | 8538112 | 32 B | 273219584 B (~260.6 MB) |
Random 10 GB | 10737418240 B (10 GB) | 262144 B (256 KB) | 40960 | 32 B | 1310720 B (1.25 MB) |
So we need to support MD5 and other common checksum algorithms for the downloading phase in case of images we know about that are hosted in the cloud.
When the user attempts to stream an image with extended information, we calculate the checksum they tell us as part of the downloading phase and compare it with what they've told us once the download completes.
In the mid-time, etcher-image-write
can calculate another type of checksum that fits this purpose (e.g: a rolling hash) and recalculate that same checksum type from the drive itself.
Another way to go would be what Tizen already provides. Their XML file contains checksums (sha1 or sha256 usually) for every block range. So keeping that in mind we can calculate the checksum of X amount of blocks and store it as we go, and calculate back and compare (thus no need of rolling hashes).
Of course this means that we're not doing per-block checksums (otherwise I guess it'd be wasteful, although I'd like to see some numbers), so we can't be that precise in our results. As far as I remember the block size we use in etcher-image-write
is 1 MB, so maybe hashing every MB is not that bad (specially if we choose a fast algorithm).
Another way to go would be what Tizen already provides. Their XML file contains checksums (sha1 or sha256 usually) for every block range. So keeping that in mind we can calculate the checksum of X amount of blocks and store it as we go, and calculate back and compare (thus no need of rolling hashes).
Indeed, I just added an issue regarding that to https://github.com/resin-io-modules/blockmap/issues/6
So we need to support MD5 and other common checksum algorithms for the downloading phase in case of images we know about that are hosted in the cloud.
Yup, I was only suggesting dropping them for the verification step, when reading back from the flashed device, but still verifying the source with them, if that makes sense?
Of course this means that we're not doing per-block checksums (otherwise I guess it'd be wasteful, although I'd like to see some numbers), so we can't be that precise in our results. As far as I remember the block size we use in etcher-image-write is 1 MB, so maybe hashing every MB is not that bad (specially if we choose a fast algorithm).
I think we could still do per block rolling hashes when using bmaps (additional to checking the bmap region checksums), as some mapped regions might be quite large, and only having to rewrite a few blocks is probably plenty faster than having to rewrite an entire mapped region.
I've been talking nonsense here, I realised;
CRC32 (which is prone to collisions), MD5, and the SHA-family (or similar) won't do us much good in this case.
Looking at the above table, we can just as well hash every block with an MD5 or SHA or whatever floats our boat and keep it in memory for the block sizes we use. Don't know why my mind was going to those complicated places with this before.
New issue to track suggestions made in https://github.com/resin-io/etcher/issues/735#issuecomment-274985283 to keep & expose data on blocks which failed to be written during the flashing of an image.