Problem Extracting 7z Archives - Githubissues

adamhathcock / sharpcompress

SharpCompress is a fully managed C# library to deal with many compression types and formats.

MIT License

2.28k stars 480 forks source link

Problem Extracting 7z Archives #73

Open hodm opened 9 years ago

hodm commented 9 years ago

There is a problem when extracting small 7z files packed with ultra compression. Exception: Data Error StackTrace: at SharpCompress.Compressor.LZMA.Decoder.Code(Int32 dictionarySize, OutWindow outWindow, Decoder rangeDecoder) in b:\Users\Hod\Documents\GitHub\sharpcompress\SharpCompress\Compressor\LZMA\LzmaDecoder.cs:line 357 get 7z file at: http://www.filedropper.com/mg83 inside there is a program.cs file that I used for test, it uses the added option that I added in pull request for, but you can remove them.

adamhathcock commented 9 years ago

I have to confess that 7z is a drop in implementation from another project. If LZMA has changed I will unlikely be able to fix it unless another project has. If it's a matter of reading 7z headers correctly then it's possible but I don't have the time these days to devote to this.

weltkante commented 9 years ago

The link doesn't seem to work for me, if you can refresh it I can take a look.

(I'm the author of the managed-lzma library used in some branches of SharpCompress, but the stack trace seems to come from the lzma-sdk implementation, right? I can take a look anyways since I understand most of the workings of the 7z format and may be able to spot where it goes wrong.)

Anarchid commented 8 years ago

This issue seems to be responsible here:

The archive in this case is http://springrts.com/dl/buildbot/default/develop/102.0.1-19-gfbc1b5b/win32/spring_%7bdevelop%7d102.0.1-19-gfbc1b5b_win32-minimal-portable.7z

Alternative implementations unpack it without a hitch.

weltkante commented 8 years ago

Surprisingly my own library also has problems reading this particular archive. Didn't expect that since (unlike the sharpcompress version) it's a straight port of the official C/C++ code ... interesting, I'll take a look and try to figure out where the implementations diverge, since the official 7z application seems to be able to unpack it.

weltkante commented 8 years ago

There's actually two issues with this file, one being that the file attributes are linux and not windows, thats (probably) the big red 81a48020 block going on in your screenshot. I'll need to have a look how the official C/C++ codebase detects and handles linux file attributes.

Then there's the exception when actually decoding the data, which is unrelated to the linux file attributes. I did have a similar exception in my own codebase which resulted from incorrect updating of the incremental lzma decoder when decoding large files. I'll have a look at the sharpcompress codebase if they have the same issue or if it is something else.

weltkante commented 8 years ago

Updating my findings about linux file attributes: they are not part of the 7z spec and won't be written by the official 7z codebase.

Some third party implementations abused the fact that the official implementation is lazy in its error checking and started to hide linux file attributes in unused bits of the 7z metadata. Newer versions of the official 7z UI acknowledge this and try to parse the unofficial linux attributes, but it's still not part of the actual 7z archive codebase, it's only handled by the UI client.

Since it now seems to have become a kind of de-facto standard I'll update my implementation to support this unofficial extension and also prepare a pull request for SharpCompress to not treat this as a broken archive.

(Still need to look at the 2nd exception when decoding a stream.)

weltkante commented 8 years ago

I've fixed the issues with this particular archive in my library, now taking a look at SharpCompress.

First thing I noted those big red blobs at the top of your screenshot are not actual errors, they are debug output from my old prototype 7z parser which someone must have copied over to SharpCompress and its still outputting a dump of the 7z structure every time it reads an archive in a debug build :-)

Anyways, even though thats not an error, since the linux file attributes make no sense in the context of .NET FileAttribute enum I'm still suggesting to strip this unofficial extension so consumers of the 7z archive are not trying to do something stupid. This should be safe because even with the posix extension present the creator of the archive should have set the low 15 bits to a valid Windows/.NET FileAttribute enum combination (which is the case in your example archive). I've done a pull request to make this change.

I'm still looking into the exception triggered during stream decoding, which is not the same as I had in my library. It's coming from the middle of the decoder so there is probably a bug in it. Unfortunately, as far as I know, the lzma decoder used in SharpCompress is based on very old code derived from a one-time Java port and nobody really supports it. I'll give it a try and run it side-by-side with my working lzma decoder, but I can make no promises that I can find the bug. In that case the only realistic short term fix I can suggest is to replace the lzma decoder with a working one.

weltkante commented 8 years ago

You are lucky, the decoder is not the problem.

7z archives may require multiple "views" into the underlying stream, as you can see in the CreateDecoderStream method which loops over a list of address ranges and creates multiple substreams from them.

In the current code you are creating ReadOnlySubStream instances which seek to the desired position and then are done with it. That's not going to work because all of your streams are operating on the same underlying stream and the last seek wins. All other ReadOnlySubStream instances are going to return garbage when reading data from them.

The obvious fix is to maintain your own position within the underlying stream and seek before every read, however since the RangeCoder classes used by BCJ2 and LZMA decoders both like to do 1-byte-reads this totally destroys performance, having to do a seek for every single byte requested from the underlying stream ;-)

Therefore I added a class BufferedSubStream, being a modified copy of the ReadOnlySubStream, having a builtin 32kb buffer simply to avoid having to seek on every request.

With the pull request applied I'm able to read the linked archive.