galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.37k stars 992 forks source link

Use python-isal for compression/decompression #12092

Open rhpvorderman opened 3 years ago

rhpvorderman commented 3 years ago

Galaxy supports fastq.gz files. For anyone interested in very fast gzip compression I recommend checking out ISA-L. Which comes with an igzip application that decompresses/compresses much faster than standard gzip. Much faster in this case means 3x faster decompression and 6x faster compression. It is available on conda-forge and can be installed with conda install -c conda-forge isa-l.

The good news is that there are also python-bindings available. These are made by me, and an extensive test set is used to ensure that it works properly. The python bindings are now used by xopen and by extension cutadapt.

Using python-isal will make decompression a lot faster. For compression there is a slight tradeoff in that the filesize will be slightly bigger as ISA-L does not support a very high compression level (but still better than gzip level 1).

EDIT: I am willing to implement this myself if there is interest. Also I forget to mention that python-isal has no dependencies (the C-library is statically linked), so there is no dependency hell.

mvdbeek commented 3 years ago

If you can make these optional imports (sound like the interface is mostly compatible with gzip ?) I think that would be a nice extension.

rhpvorderman commented 1 month ago

I see isal is now a hard dependency due to your work on https://github.com/galaxyproject/galaxy/pull/17342

I see the current gz to uncompressed converter uses gzip -dcf. However since python-isal is required, python-m isal.igzip should also be available.

To illustrate the difference I decompress a 1.6GB fastq file here:

Benchmark 1: python -m isal.igzip -cd ~/test/5millionreads_R1.fastq.gz > /dev/null
  Time (mean ± σ):      2.008 s ±  0.011 s    [User: 1.956 s, System: 0.051 s]
  Range (min … max):    1.997 s …  2.028 s    10 runs
 Benchmark 1: gzip -cd ~/test/5millionreads_R1.fastq.gz > /dev/null
  Time (mean ± σ):      8.162 s ±  0.080 s    [User: 8.103 s, System: 0.058 s]
  Range (min … max):    8.093 s …  8.375 s    10 runs

4 times faster! By the way, this is mostly due to gzip's code, not to zlib. If I use the pigz implementation on one thread the decompression is also faster than gzip:

Benchmark 1: pigz -p 1 -cd ~/test/5millionreads_R1.fastq.gz > /dev/null
  Time (mean ± σ):      4.123 s ±  0.025 s    [User: 4.076 s, System: 0.047 s]
  Range (min … max):    4.089 s …  4.173 s    10 runs

Still, that makes the python -m isal.igzip command two times faster than any zlib alternative for decompression. Is there a way this could be leveraged in https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/datatypes/converters/gz_to_uncompressed.xml?