aboutcode-org / extractcode

A mostly universal file extraction library and CLI tool to extract almost any archive in a reasonably safe way on Linux, macOS and Windows.
https://www.aboutcode.org/
31 stars 17 forks source link
7zip archive bzip2 cab cpio decompression extract extractor gzip iso9660 libarchive lzma tar xz zip zstd

============ ExtractCode

Supports Windows, Linux and macOS on 64 bits processors and Python 3.6 to 3.9.

ExtractCode is a (mostly) universal archive extractor.

Install with::

pip install extractcode[full]

Why another extractor?

it will extract!

ExtractCode will extract things where other archive and compressed file extractors may fail.

ExtractCode supports one of largest number of archive formats listed in the long List of supported archive formats_ found at the bottom of this document.

In all these cases, ExtractCode will extract and try hard do the right thing to obtain the actual archived content when other tools may fail.

It can also extract recursively any type of (nested) archives-in-archives.

As a downside, the extracted content may not be exactly what would be extracted for a typical usage of the contained files: for instance some file may be renamed, special files and symlinks are skipped, permissions and owners are changed but this it is fine for primary the use case which is analysis of file content for software composition or forensic analysis.

Behind the scene, ExtractCode uses multiple tools such as:

With these, it is possible to extract a large number of common and less common archives and compressed file types. ExtractCode tries to extract things in the same way on all supported OSes, including auto-renaming files that would have invalid, non-extractible names on certain filesystems or when there are multiple copies of the same path in a given archive (which is possible in a tar).

The extraction is driven from a "voting" system that considers the file extension(s) and name, the filetype and mimetype (using a ctypes binding to libmagic) to select the most appropriate extractor or decompressor function. It can handle multi-level archives such as tar.gz and can extract recursively any nested archives.

Visit https://aboutcode.org and https://github.com/nexB/ for support and download.

We run CI tests on:

Installation

To install this package with its full capability (where the binaries for 7zip and libarchive are installed), use the full extra option::

pip install extractcode[full]

If you want to use the version of binaries (possibly) provided by your operating system, use the minimal option::

pip install extractcode

In this case, you will need to provide a working and compatible libarchive and 7zip installed and configured in one of these ways such that ExtractCode can find them:

The supported binary tools versions are:

Development

To set up the development environment::

./configure --dev
source venv/bin/activate

To run unit tests::

pytest -vvs -n 2

To clean up development environment::

./configure --clean

To run the command line tool in the activated environment::

./extractcode -h

Configuration with environment variables

ExtractCode will use these environment variables if set:

Adding support for VM images extraction

Adding support for VM images requires the manual installation of the libguestfs-tools system package. This is supported only on Linux. On Debian and Ubuntu you can use this command::

sudo apt-get install libguestfs-tools

On Ubuntu only, an additional manual step is required as the kernel executable file cannot be read by users as required by libguestfish.

Run this command as a temporary and immediate fix::

sudo chmod 0644 /boot/vmlinuz-*
for k in /boot/vmlinuz-*
    do sudo dpkg-statoverride --add --update root root 0644 /boot/vmlinuz-$k
done

You likely want both this temporary fix and a more permanent fix; otherwise each kernel update will revert to the default permissions and ExtractCode will stop working for VM images extraction.

Therefore follow these instructions:

  1. As sudo, create the file /etc/kernel/postinst.d/statoverride with this content, devised by Kees Cook (@kees) in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/759725/comments/3 ::

    !/bin/sh

    version="$1"

    passing the kernel version is required

    [ -z "${version}" ] && exit 0 dpkg-statoverride --update --add root root 0644 /boot/vmlinuz-${version}

  2. Set executable permissions::

    sudo chmod +x /etc/kernel/postinst.d/statoverride

See also these links for a complete discussion:

- https://bugs.launchpad.net/ubuntu/+source/linux/+bug/759725
- https://bugzilla.redhat.com/show_bug.cgi?id=1670790
- https://bugs.launchpad.net/ubuntu/+source/libguestfs/+bug/1813662/comments/24

Alternative

These other tools are related and were considered before creating ExtractCode:

These tools provide built-in, original extraction capabilities:

These tools are command line tools wrapping other extraction tools and are similar to ExtractCode but with different goals:

List of supported archive formats

ExtractCode can extract the following archives formats:

Archive format kind: docs


  name: Office doc
     - extensions: .docx, .dotx, .docm, .xlsx, .xltx, .xlsm, .xltm, .pptx, .ppsx, .potx, .pptm, .potm, .ppsm, .odt, .odf, .sxw, .stw, .ods, .ots, .sxc, .stc, .odp, .otp, .odg, .otg, .sxi, .sti, .sxd, .sxg, .std, .sdc, .sda, .sdd, .smf, .sdw, .sxm, .stw, .oxt, .sldx, .epub
     - filetypes : zip archive, microsoft word 2007+, microsoft excel 2007+, microsoft powerpoint 2007+
     - mimetypes : application/zip, application/vnd.openxmlformats

  name: Dia diagram doc
     - extensions: .dia
     - filetypes : gzip compressed
     - mimetypes : application/gzip

  name: Graffle diagram doc
     - extensions: .graffle
     - filetypes : gzip compressed
     - mimetypes : application/gzip

  name: SVG Compressed doc
     - extensions: .svgz
     - filetypes : gzip compressed
     - mimetypes : application/gzip

Archive format kind: regular

name: Tar

Archive format kind: regular_nested


  name: Tar xz
     - extensions: .tar.xz, .txz, .tarxz
     - filetypes : xz compressed
     - mimetypes : application/x-xz

  name: Tar lzma
     - extensions: tar.lzma, .tlz, .tarlz, .tarlzma
     - filetypes : lzma compressed
     - mimetypes : application/x-lzma

  name: Tar gzip
     - extensions: .tgz, .tar.gz, .tar.gzip, .targz, .targzip, .tgzip
     - filetypes : gzip compressed
     - mimetypes : application/gzip

  name: Tar lzip
     - extensions: .tar.lz, .tar.lzip
     - filetypes : lzip compressed
     - mimetypes : application/x-lzip

  name: Tar lz4
     - extensions: .tar.lz4
     - filetypes : lz4 compressed
     - mimetypes : application/x-lz4

  name: Tar zstd
     - extensions: .tar.zst, .tar.zstd
     - filetypes : zstandard compressed
     - mimetypes : application/x-zstd

  name: Tar bzip2
     - extensions: .tar.bz2, .tar.bz, .tar.bzip, .tar.bzip2, .tbz, .tbz2, .tb2, .tarbz2
     - filetypes : bzip2 compressed
     - mimetypes : application/x-bzip2

  name: lz4
     - extensions: .lz4
     - filetypes : lz4 compressed
     - mimetypes : application/x-lz4

  name: zstd
     - extensions: .zst, .zstd
     - filetypes : zstandard compressed
     - mimetypes : application/x-zstd

  name: Tar 7zip
     - extensions: .tar.7z, .tar.7zip, .t7z
     - filetypes : 7-zip archive
     - mimetypes : application/x-7z-compressed

  name: Tar Z
     - extensions: .tz, .tar.z, .tarz
     - filetypes : compress'd data
     - mimetypes : application/x-compress

Archive format kind: package

name: Ruby Gem package

Archive format kind: file_system


  name: ISO CD image
     - extensions: .iso, .udf, .img
     - filetypes : iso 9660 cd-rom, high sierra cd-rom
     - mimetypes : application/x-iso9660-image

  name: SquashFS disk image
     - extensions:
     - filetypes : squashfs
     - mimetypes :

  name: QEMU QCOW2 disk image
     - extensions: .qcow2, .qcow, .qcow2c, .img
     - filetypes : qemu qcow2 image, qemu qcow image
     - mimetypes : application/octet-stream

  name: VMDK disk image
     - extensions: .vmdk
     - filetypes : vmware4 disk image
     - mimetypes : application/octet-stream

  name: VirtualBox disk image
     - extensions: .vdi
     - filetypes : virtualbox disk image
     - mimetypes : application/octet-stream

Archive format kind: patches

name: Patch

Archive format kind: special_package



  name: InstallShield Installer
     - extensions: .exe
     - filetypes : installshield
     - mimetypes : application/x-dosexec

  name: Nullsoft Installer
     - extensions: .exe
     - filetypes : nullsoft installer
     - mimetypes : application/x-dosexec