diskfs / go-diskfs

MIT License
494 stars 112 forks source link

squashfs: fix file reading #197

Closed ncw closed 8 months ago

ncw commented 8 months ago

This is a series of patches to fix up the reading in squashfs.

The first introduces some new tests and then we attempt to fix the tests!

Update all tests working now!

ncw commented 8 months ago

Update all the tests are fixed now - turned out to be a very simple fix.

This could be merged now so I'll unset the draft.

deitch commented 8 months ago

Do you mind describing in prose here what the issue was? I am reading it in the code, but a description here would be very helpful.

deitch commented 8 months ago

Also, is the logic in the comments right before the changes in squashfs/file.go still correct? Or does it need updating?

ncw commented 8 months ago

Do you mind describing in prose here what the issue was? I am reading it in the code, but a description here would be very helpful.

I noticed when testing the code that it had trouble reading most files. In fact the only files it could read reliably were very short ones stored entirely in the fragments or files which were an exact multiple of the block size. These happen to be the type of files in the file.sqs!

I decided to try to fix this. The first thing I did was make a new read_test.sqs with files of different sizes in which I knew would exercise the different parts of the code (I updated this as I carried on with the debugging process).

This let me track down the problems with the code.

I made one commit per fix to try and keep the code reviewable.

commit e608eb191686b06cad479be259ce209517ed01f3

squashfs: fix parsing of extended file inodes

Before this change it was accidentally using the size of the basic
file inode as the offset of the blocklist.

This was just a copy paste bug I think. the 16 got copy pasted from the basic inode and needed to be changed to 40.

commit 13ce837eaa47aa6e9c4db731f3d6130882b27a9d

squashfs: add check for unexpected block size

This commit isn't strictly speaking necessary - it doesn't fix anything, but it makes it return a sensible error when things aren't as expected which was the case for quite a lot of the time when debugging.

commit d105557d39f7bfe87418a7878516a6254761992c

squashfs: close the gz compressor to make sure we read CRC errors

This is as recommended by the go docs. This doesn't fix anything, it was just something I noticed while reviewing the code.

commit 11f9a4c489e37add7d939001f6092f1c9989bb36

squashfs: fix file reading

- Fix array bounds errors by taking more care reading the correct data out of the buffers
- Fix potential 32/64 bit problems
- Fix off by one termination condition in loop
- Error if a fragment is expected but we don't have one (shouldn't happen)
- Error if we ever read no data (shouldn't happen)

This was the big fix. The main problem here was that this code gave array index errors if you tried to read something which wasn't an exact block size. It went through quite a few iterations trying to get all the test cases to pass but I think it is correct now :crossed_fingers:

The main fix was to be very careful with exactly how much data was read from the input block. The fragment reading needed the same logic so I factored it into a closure.

This introduces the pos variable which keeps track of file position while trying to find the correct blocks to read data from. This is what enables the correct clipping of the blocks to be done and fixes the index errors.

I also added some defensive programming which was used when trying to make it work before I realised that the blocklist was actually wrong (see above) and it was being passed corrupted data.

I fixed up some 32/64 bit problems to make sure that the code would work on architectures where int is 32 bits.

commit 5bb1d3854a579ad6645e53a9a2d1cddae85447ca

squashfs: fix block size list parsing

- make sure we don't overflow int when using large files
- fix reading 4x too few blocks
- correct inverted logic on whether a fragment exists or not
- pass the calculated block sizes into the fetching routine

This was the other very important fix

There were two main things wrong here. The number of blocks was being calculated wrong (the fragment exists logic was back to front) and this was using byte indexing when it should have been using uint32 indexing. This does the calculation of the number of blocks once and passes it down to parseFileBlockSizes rather than getting it to calculate it again (which it was doing wrong, but in a different way!).

I also fixed up some 32/64 bit problems to make sure that the code would work on architectures where int is 32 bits.

commit 96796b1f918717fdd66daf74c8986f21710638d7

squashfs: a zero sized chunk is to be intepreted as a full block of zeros

This is as specifed in the specification.

This is mentioned in the spec - a zero sized block should be treated as block_size worth of zeroes.

Sparse files are handled at block_size granularity. If an entire block is found to be full of zero bytes, the block isn't written to disk. Instead a size of zero is stored in the inode.

commit 49683234a565293f3191de6ee1c41668f9b060a6

squashfs: use the blocksize of squashfs not that supplied by the user for decompression

I noticed that it was using the wrong blocksize when using an external file. It needs to use the blocksize from the squashfs header not the one the user passes in otherwise everything goes wrong!

Also, is the logic in the comments right before the changes in squashfs/file.go still correct? Or does it need updating?

Yes it still works exactly like that.

For a future improvement it would be sensible to cache the last uncompressed block in the file handle as unless you are reading at exactly block_size it decompresses the same data twice (or indeed many times).

deitch commented 8 months ago

For a future improvement it would be sensible to cache the last uncompressed block in the file handle as unless you are reading at exactly block_size it decompresses the same data twice (or indeed many times).

Yeah, that would be good... for a future one. I think this one covers enough.

Thank you for the detailed explanation.

Based on how you described it, it looks like each commit actually could stand alone? If each were a separate PR, then the mainline branch would be valid at each given commit? No, I don't want separate PRs at this point, but if they really do stand alone, no point in doing a squash-and-merge, as I usually do.

ncw commented 8 months ago

For a future improvement it would be sensible to cache the last uncompressed block in the file handle as unless you are reading at exactly block_size it decompresses the same data twice (or indeed many times).

Yeah, that would be good... for a future one. I think this one covers enough.

Yes I'll do that as a followup. I think it will double the performance for file reading for relatively little cost (a small buffer per open file).

Thank you for the detailed explanation.

Based on how you described it, it looks like each commit actually could stand alone? If each were a separate PR, then the mainline branch would be valid at each given commit? No, I don't want separate PRs at this point, but if they really do stand alone, no point in doing a squash-and-merge, as I usually do.

Yes each commit does stand alone. I tried to make the commits individually reviewable as I know how hard reviewing code is. They could be individual PRs but if you were to merge without squashing that would have the same effect. (I developed this patch series in a completely different order and when I was done fixing stuff I separated them out into logically separate fixes!).

Took me quite a bit to understand it (but a lot less than understanding the innards of squashfs 😆 ), but I see it now.

:-) I can't say I understand the innards of squashfs completely, but I know more about it now than I did!

deitch commented 8 months ago

I can't say I understand the innards of squashfs completely, but I know more about it now than I did!

Well, if you feel like learning more about ext4 and inode trees, the help would be appreciated. ext4 implementation has been slowed down a lot for a long time. Largely blocked on some of the inode tree rebalancing. I am sure there is more there. One of these lifetimes I will get someone who wants it enough to pay for a few months of consulting and finish it. I have been tempted to ask GPT-4 as well.