aboutcode-org / extractcode

A mostly universal file extraction library and CLI tool to extract almost any archive in a reasonably safe way on Linux, macOS and Windows.
https://www.aboutcode.org/
31 stars 17 forks source link

extractcode errors out on #61

Open mloeser21 opened 5 months ago

mloeser21 commented 5 months ago

Hi, I'm running into a problem with certain .lz4 and also .jar files. Example (lz4):

$:~/SCAN_IMAGES/release-1.13.zip-extract$ ~/scancode-toolkit/extractcode ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
Extracting archives...
[####################] 4
ERROR extracting: /home/joe/SCAN_IMAGES/release-1.13.zip-extract/release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4: Unrecognized archive format
Extracting done.

But the file has substance and can be decompressed using the lz4 utility:

$:~/SCAN_IMAGES/release-1.13.zip-extract$ ls -al ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb
.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
-rw-r--r-- 1 joe users 17315708 Apr 29  2023 ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4

$:~/SCAN_IMAGES/release-1.13.zip-extract$ lz4 -t ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
./release/deploy_art : decoded 45545571 bytes
$:~/SCAN_IMAGES/release-1.13.zip-extract$ lz4 --list ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists
/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
    Frames           Type Block  Compressed  Uncompressed     Ratio   Filename
         1       LZ4Frame   B4D      16.51M             -         -   deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
$:~/SCAN_IMAGES/release-1.13.zip-extract$ lz4 -dv ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/de
b.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
*** LZ4 command line interface 64-bits v1.9.3, by Yann Collet ***
Decoding file ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages
./release/deploy_art : decoded 45545571 bytes

Following is what the file header looks like:

$:~/SCAN_IMAGES/release-1.13.zip-extract$ hexdump ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4 | head
0000000 2204 184d 4040 cdc0 0078 f200 5003 6361
0000010 616b 6567 203a 6130 0a64 6f53 7275 0c63
0000020 f600 2008 3028 302e 322e 2e33 2d31 2935
0000030 560a 7265 6973 6e6f 203a 0015 7cf5 622b
0000040 0a31 6e49 7473 6c61 656c 2d64 6953 657a
0000050 203a 3032 3632 0a38 614d 6e69 6174 6e69
0000060 7265 203a 6544 6962 6e61 4720 6d61 7365
0000070 5420 6165 206d 703c 676b 672d 6d61 7365
0000080 642d 7665 6c65 6c40 7369 7374 612e 696c
0000090 746f 2e68 6564 6962 6e61 6f2e 6772 0a3e

The magic bytes are correct, pls refer to https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md

Why can lz4 decode it properly but extractcode cannot?

Regards, Matthias

pombredanne commented 5 months ago

@mloeser21 Thanks for the report.

http://deb.debian.org/debian/dists/bullseye/main/binary-amd64/ does not seem to use lz4 .... just curious how this was made? This looks like a container image. Which base do you use?

Also could you attach the file in question? ( wrapped in a zip to make GH happy)?

mloeser21 commented 5 months ago

Hi @pombredanne Thanks for the quick response. The container image was created as follows:

docker save ghcr.io/apollographql/router:v1.29.1 -o /tmp/
tar zcvf release/deploy_artifacts/router.tar.gz /tmp/router.tar

As part of router.tar you get various layer.tar archives:

$:~/SCAN_IMAGES/tmp$ tar xvf router.tar
0b1d3fcc8ae40e41993edcb2760b68907f0b94e2525bc1fb58537d5ef5c28018.json
0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/
0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/VERSION
0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/json
0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar
594bbcbc832bef41ae6140bdd25a807811947156b22df5d8ccb141ff1758e02a/
594bbcbc832bef41ae6140bdd25a807811947156b22df5d8ccb141ff1758e02a/VERSION
594bbcbc832bef41ae6140bdd25a807811947156b22df5d8ccb141ff1758e02a/json
594bbcbc832bef41ae6140bdd25a807811947156b22df5d8ccb141ff1758e02a/layer.tar
5ab70ccc1fc958e7cf78cadd641c642e8f4f5bc267797887bc1a636fb69b8a87/
[...]

And those layer.tar archives contain various .lz4 compressed files:

[...]
var/lib/apt/lists/deb.debian.org_debian-security_dists_bullseye-security_InRelease
var/lib/apt/lists/deb.debian.org_debian-security_dists_bullseye-security_main_binary-amd64_Packages.lz4
var/lib/apt/lists/deb.debian.org_debian_dists_bullseye-updates_InRelease
var/lib/apt/lists/deb.debian.org_debian_dists_bullseye-updates_main_binary-amd64_Packages.lz4
var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_InRelease
var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
[...]
$ file var/lib/apt/lists/deb.debian.org_debian-security_dists_bullseye-security_main_binary-amd64_Packages.lz4
var/lib/apt/lists/deb.debian.org_debian-security_dists_bullseye-security_main_binary-amd64_Packages.lz4: LZ4 compressed data (v1.4+)

Does this help?

pombredanne commented 5 months ago

@mloeser21 re:

Does this help?

Yes, this is exactly what's needed to track this down!