Extraction errors for ISOs that are handled by 7zip

jhhcs commented 2 years ago

This is related to #3, #81, and #76. I am using this (fantastic, might I say) library in malware research to unpack ISO files that contain malicious samples. It works extremely well. Every now and then, however, I run into an ISO file that can be unpacked without errors using 7zip, but pycdlib throws an exception. I have collected a few samples that illustrate each of the different exceptions I have run into so far:

errors.zip (password: infected) Note that this archive contains ISO files that very likely contain malware. Please handle accordingly.

The following is a test script that does everything I also do in production; it expects the above archive to be extracted to a folder called "errors" in its the current working directory. It is supposed to list the names of all files within each ISO together with some timestamp information.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import datetime
import pycdlib
import pathlib
import io

for path in pathlib.Path.cwd().glob('errors/*'):
    listing = []
    try:
        iso = pycdlib.PyCdlib()
        with open(path, 'rb') as stream:
            data = stream.read()
        with io.BytesIO(data) as stream:
            iso.open_fp(stream)
        if iso.has_udf():
            facade = iso.get_udf_facade()
        elif iso.has_joliet():
            facade = iso.get_joliet_facade()
        elif iso.has_rock_ridge():
            facade = iso.get_rock_ridge_facade()
        else:
            facade = iso.get_iso9660_facade()
        for root, _, files in facade.walk('/'):
            root = root.rstrip('/')
            for name in files:
                name = name.lstrip('/')
                name = F'{root}/{name}'
                try:
                    info = facade.get_record(name)
                except Exception:
                    info = None
                    date = None
                else:
                    date = datetime.datetime(
                        info.date.years_since_1900 + 1900,
                        info.date.month,
                        info.date.day_of_month,
                        info.date.hour,
                        info.date.minute,
                        info.date.second,
                        tzinfo=datetime.timezone(datetime.timedelta(minutes=15 * info.date.gmtoffset))
                    )
                listing.append(F'{path.name}: {date.isoformat("T","seconds")} {name}')
    except Exception as error:
        print(F'{path.name}: ERROR; {error!s}')
    else:
        print(*listing, sep='\n')

With the samples from the above archive and version 1.12.0 of pycdlib, the output is the following:

00cd2850489cf197ac4ba99786d192dd0135fdbd6922130a0ad17ecf905de2d1: ERROR; Invalid SUSP SP record
06cd1e75f05d55ac1ea77ef7bee38bb3b748110b79128dab4c300f1796a2b941: ERROR; File structure version expected to be 1
13a28a1c9dd619fb9ac9e6db7e7ab291d4944e3367b847be13807821c8d0cf3e: ERROR; UDF File Set Terminator Tag identifier not 8
1ee7a84afebc657e0252a828b95e19eaeb95d9eb59f025de4ce5a85d8637f7d6: ERROR; 'NoneType' object has no attribute 'name'
276e7a55df16522ee3e187815ff76fa3562b8a609632acbb9fea4ec385941694: ERROR; Unknown SUSP record
27fdfc58cadf156a2385fd72923511acc1b309c96ab0f7b3b0edc99edad1960a: ERROR; Little-endian and big-endian set size disagree
2d484db5b69f000267f0f176e32fd3adf1e4b6e99876d2f77c76eb5018813010: ERROR; 'NoneType' object has no attribute 'name'
2fbda2bbcefb56cabcbc0b2bd1bf0a83300b00198810396c7acbe5743ec52a5c: ERROR; list index out of range
3547ce8751f05737ef3c8b7fda4929e0d74b710e033e2b8ae20d7873688930cc: ERROR; Inconsistent Rock Ridge versions on the ISO!
6e3df3aa5f6dedd765a865ca7799982154a9c19f230d80a469d6cffd5e7cbf72: ERROR; unpack_from requires a buffer of at least 8 bytes for unpacking 8 bytes at offset 0 (actual buffer size is 6)
7236d0770b1aaa5ef162660713fdfc4237050b63ad7f3d3e929ccff6712ab34e: ERROR; unpack_from requires a buffer of at least 512 bytes for unpacking 512 bytes at offset 0 (actual buffer size is 0)
822c04f4af4d112035cccb9723492d0e1cc33d39c64a1fc8770359443343353b: ERROR; Little-endian and big-endian seqnum disagree
b229372a2b31f94799b84ae96c4d87327faabcc1cccd2cf3e4f6afb2aa2e7a23: ERROR; Expected at least 2 UDF Anchors
d0b714860765ab1c5d751a7237b1c768248754f2b593b70deccac266366f2289: ERROR; 'UDFFileEntry' object has no attribute 'date'
d17b4c19412a5ad927b6249a29a3b7901ce8815dcfa048531e67a4439298dcc5: ERROR; Expected at least 2 UDF Anchors
e1567a6661b475460efe1b89a9ead760cc887b0951c74e6b70631bc08e8e4a5b: ERROR; list index out of range
e81af9201d76081c695f616a118b0c7e16087d8a8bef5e44daa63e7396bd8c4f: ERROR; Little-endian and big-endian path table size disagree
ea6cd95fbceed7e3048f79d1c596abc770ef8a4bc3661785570188aa65e01a36: ERROR; File structure version expected to be 1
fb30d6dd48127f24ac503578fc42832847133f1a93b01b7c9f7f6c7ffa3c2a63: ERROR; unpack_from requires a buffer of at least 512 bytes for unpacking 512 bytes at offset 0 (actual buffer size is 0)
fd7f46fa918330922a485fca84bab48ea9d069e0bdb21d2230cc2b874ccc8c48: ERROR; 'UDFFileEntry' object has no attribute 'date'
ff42e1a03e7d70b48305d83edf6ed5fce0a858ef001acafab86e9e03c561f30c: ERROR; UDF File Set Terminator Tag identifier not 8

I understand that you were previously (see #81) outspoken against an option for "relaxed" parsing. Have you considered, instead, to use the logging module to display warnings that come up during parsing, and to only throw an exception when it is absolutely impossible to recover? I think this would be a good solution as warnings do not remain invisible this way, but if the library can recover, it does not prevent extraction of the files either.

PS: I would be happy to contribute with a pull request if you can generally agree to the above strategy.

clalancette commented 2 years ago

Hi there,

First, thanks for using the library and bringing this up.

I understand that you were previously (see #81) outspoken against an option for "relaxed" parsing. Have you considered, instead, to use the logging module to display warnings that come up during parsing, and to only throw an exception when it is absolutely impossible to recover? I think this would be a good solution as warnings do not remain invisible this way, but if the library can recover, it does not prevent extraction of the files either.

Yes, I am very agreeable to this. In point of fact, my strategy up until this point has been to relax the parsing requirements as people have run into them. I'd say that there are currently 3 different "levels" of parsing errors currently:

Ones in which it is impossible to go on. This would be things like the root directory record being corrupted, or other such things that would prevent pycdlib from enumerating the metadata on the ISO. (Note that even in these cases, I can imagine some "recovery" tools that might be able to reconstruct some of this data, but I don't think in the normal case we want pycdlib to do this).
Ones in which it might be possible to go on. This includes things like the big-endian sizes disagreeing with the little-endian sizes. This usually indicates that the ISO is corrupt in some way, but we can just use the little-endian size and hope for the best.
Ones in which it is definitely possible to go on. This includes checking some of the "dead" areas for compliance for all zeros, for instance.

For errors of type 1, there's really nothing we can do; if essential metadata is wrong or missing, we have to give up. For errors of type 2, we currently always throw an exception, but we could log an error and attempt to go on. For errors of type 3, we could either ignore the problem completely (though this might indicate a corrupt ISO), or we could log an error and go on.

I'd be happy to relax things for errors of type 2 and type 3, though the devil is going to be in the details of which exceptions are of which of the three types.

PS: I would be happy to contribute with a pull request if you can generally agree to the above strategy.

That would be great! That would provide us a place to discuss whether each one is of type 1, 2, or 3 (and it might even make sense to extend the PycdlibException classes to formally encode that distinction). Thanks for bringing this up.

jhhcs commented 2 years ago

Fantastic, thanks for the response. I'll dive into the code and I hope to have some pull requests for you soon. I'll probably open a separate PR for each individual "relaxation" and use this issue to track our progress from more of a bids-eye view.

huettenhain commented 2 years ago

I finally sat down to do some of this. I have a question: Would you agree to having the test samples I have procured added to the repository using git-lfs? I would like to do this test-based, i.e. add one test for each outlier and then provide a suggested adjustment, and then do one pull request for each of those.

huettenhain commented 2 years ago

PS: Since they are potential malware samples, I would store them in a quarantine format (XOR-encrypted with a fixed key and compressed).

clalancette commented 2 years ago

PS: Since they are potential malware samples, I would store them in a quarantine format (XOR-encrypted with a fixed key and compressed).

I really don't want to be storing and then providing malware-laden ISOs anywhere. What I'd rather do here is to look at the changes you are proposing, and then specially hand-craft ISO files that trigger those problems. Usually I do this kind of thing in the tests by generating a valid ISO, then going in and modifying a few bytes as necessary (something like https://github.com/clalancette/pycdlib/blob/5b341b35a5846a24349227c04cd6e81561dee3fc/tests/integration/test_parse.py#L1255-L1274).

huettenhain commented 2 years ago

I understand. Unfortunately, it is beyond my understanding of the ISO format to do this, and on my fork I still have 8 samples where I don't even have any idea how to achieve successful extraction 😓.

clalancette commented 1 year ago

Given that we have merged at least one fix for this issue, and the age of it, I'm going to close this out. Do feel free to open new issues/pull requests for anything else that you'd like to attempt to relax, and I'll be happy to look at it.

huettenhain commented 1 year ago

Absolutely, that makes sense. I haven't found enough time to look into it further, and if I do, I will create separate issues.

clalancette / pycdlib

Extraction errors for ISOs that are handled by 7zip #82