dtrx-py / dtrx

Do The Right Extraction
GNU General Public License v3.0
224 stars 10 forks source link

DTRX and docker saved images #13

Closed erhan- closed 2 years ago

erhan- commented 3 years ago

I have a wrapper for dtrx and use it in a batch process.

When docker images are saved like in the documentation and dtrx is run on the exported tar.gz, it extracts everything and somewhere runs cpio on it. During this command it gets stuck and shows error about malformed content. I have to kill the cpio command at that moment so that my batch process continues. Has anyone extracted a docker image with dtrx before?

noahp commented 3 years ago

I'll take a look!

erhan- commented 2 years ago

After much digging I found a file where this occurs: When I run dtrx -rn on this. It silently fails until you kill the cpio subprocess it runs. Then you will see errors like:

cpio: Malformed number |#x;
cpio: Malformed number |#x;`
cpio: Malformed number x|#x;`
cpio: Malformed number |#x;`
cpio: Malformed number #x;`
cpio: Malformed number #x;`
cpio: Malformed number x;`
!$ file cpio
cpio: ELF 32-bit MSB executable, PowerPC or cisco 4500, version 1 (SYSV), dynamically linked, interpreter /lib/ld.so.1, for GNU/Linux 2.6.4, stripped

I am not sure how to upload this file.

So I renamed the file and tried to run dtrx -rn again and see there:

dtrx -rn testoo
dtrx: ERROR: could not handle testoo
dtrx: ERROR: not a known archive type

Okay lets rename it back again to "cpio": Aaaand stuck again :)

So lets see in the function try_by_extension():

https://github.com/dtrx-py/dtrx/blob/5aee09c12de0d57c2f77ee6b04a19ca368792b12/scripts/dtrx#L1347

So the problem is basically if the file has no dots and has the same name as a known extension.

>>> filename = "cpio"
>>> parts = filename.split(".")[-2:]
>>> parts
['cpio']
>>> filename = "blabla.cpio"
>>> parts = filename.split(".")[-2:]
>>> parts
['blabla', 'cpio']
>>> filename = "blabla.jdksj.dsjkj.tar.gz"
>>> parts = filename.split(".")[-2:]
>>> parts
['tar', 'gz']

This means we should check in this function if len(parts) >1 and then add to the results.

There are many ways to achieve this, implementation can vary:

    def try_by_extension(cls, filename):
        parts = filename.split('.')[-2:]
        results = []
        if len(parts) == 1:
            return results
        while parts:
            results.extend(cls.extension_map.get('.'.join(parts), []))
            del parts[0]
        return results

I just wrote it like this but you can do in any other way.

And lets test it:

!$ dtrx -rn cpio
dtrx: ERROR: could not handle cpio
dtrx: ERROR: not a known archive type

I will create a MR at home.

noahp commented 2 years ago

AH! the problem is due to cpio not doing magic number verification before attempting extraction, and it crashes/hangs (depending on the particular cpio binary it's attempting to extract).

Reproducing is quite easy:

# this image contains the cpio binary that causes the extraction to hang
❯ docker pull alpine:3.13.6
❯ docker image save alpine:3.13.6 -o alpine.tar.gz
❯ dtrx -rn alpine.tar.gz

The fundamental command that hangs is:

# note: this is the extracted image from above
❯ cpio -i --make-directories --quiet --no-absolute-filenames --file alpine/f7055e235a8665ac2ae79f29bd773c7a40b409e9c5d71905fb6bcb6458d9b66a/layer/usr/bin/cpio

Your fix looks good! I've put a PR up at #15.