Closed droazen closed 6 years ago
@erniebrau @pnvaidya Would you be able to take this evaluation on in George's absence?
@droazen It occurs to me that one other axis that might be worth looking at is quality score quantization. I think the BAMs I was looking at were ones that had [unrecalibrated] full range quality scores, not ones that had been quantized to ~8 or fewer values. I wonder if the test files used above were all HiSeq-X whether that could have made a difference.
@tfenne Yeah, that's a good point -- definitely something to look at this time around.
I will coordinate with @pnvaidya to take a look at this. Let's discuss a bit more on Monday.
Compression Level | MD avg runtime | MD output size | SortBam avg runtime | SortBam output size |
---|---|---|---|---|
1 | 4.658 hours | 155.15 GB | 3.47 hours | 96.39 GB |
2 | 4.94 hours | 88.14 GB | 3.633 hours | 67.44 GB |
5 | 6.658 hours | 83.2 GB | 4.996 hours | 59.53 GB |
This data was calculated by running the same workflow 6 times for each compression level over 30X WGS data
@tfenne Did you do any testing with compression level 2 + the Intel Deflater? @jsotobroad 's latest results suggest that it may offer the best speed/size tradeoff, with the current version of the GKL at least.
@droazen I didn't really look at runtime scientifically, so I can't comment on that. I pasted this table and explanation into the picard issue, but I think it had already been closed at that point, so I'm not sure how many people saw it:
I have some small test BAMs that are constructed by extracting reads overlapping a few hundred kb of genome from WGS samples. I made the table below using such a BAM made from the 1KG PCR-free WGS data from NA19625. Not ideal, but I would think would perform fairly similar to a full WGS bam for compression purposes. What I see is that at compression level 1 the intel deflator produces a significantly larger BAM that the JDK deflator at level=1.
Compression Level | Intel Deflater File Size | Intel Deflater % of JDK l=5 | JDK Deflater File Size | JDK Delfater % of JDK l=5 |
---|---|---|---|---|
1 | 54,840,445 | 175.23% | 38,543,684 | 123.16% |
2 | 35,782,642 | 114.33% | 36,745,494 | 117.41% |
3 | 34,989,899 | 111.80% | 35,262,326 | 112.67% |
4 | 31,815,698 | 101.66% | 32,549,560 | 104.00% |
5 | 31,240,892 | 99.82% | 31,296,433 | 100.00% |
6 | 30,675,174 | 98.01% | 30,577,906 | 97.70% |
7 | 30,379,699 | 97.07% | 30,380,325 | 97.07% |
8 | 30,124,200 | 96.25% | 30,124,375 | 96.25% |
9 | 30,064,322 | 96.06% | 30,064,325 | 96.06% |
That does seem to suggest that the intel deflator at level=2
produces a BAM that is smaller than the JDK deflator at either level 2 or 1, and if it is also faster in your testing, that sounds pretty good for intermediate files. It's still ~15% bigger than a level=5
BAM though, so unless the vast majority of users are switching over to CRAM for storage, I'd hesitate to change the default compression level in any of the toolkits.
We have updated to a newer version of compression engine in GKL. Could you test with the latest GATK/GKL? and report if the problem still exists.
Look at https://my.pgp-hms.org/public_genetic_data for BAM files, especially if you want to ensure data from different companies:
How to convince people: I agree. I think it is most effective to make people "feel" the difference, i.e. output like "you have been waiting 1324s or 60% of additional processing time on this step due to compression". Or "Processing still hasn't started due to compression/decompression."
GATK4, especially on Spark hides that pretty well. For example, turning off Spark lz4 and relying on ZFS lz4 for the writing of temporary data was instructive about how much CPU was used for it (not that much).
Multi-core compression mostly cuts files into pieces and can greatly decrease compression if the data is highly repetitive. Because another core starts anew on data that the previous one might have reduced to almost nothing (zstd allows some sharing of the dictionary between cores, but most do not I think).
Example about the dictionary difference: For long distance repetitive files, compression with xz --lzma2=preset=1,dict=1500M can bringe a huge gain in compression, but still be much faster than level 9 (which has normally only a dictionary of 64MB).
Compression levels are correlated with dict size for most compressors to ensure monotonically increasing memory usage, but that doesn't have to be so. zstd, for example, allows many parameters to change this. Even more than xz.
I suspect due to my experiments that quality values gain more from increased dictionary size, because they are more repetitive than the DNA data.
And shorter BAMs would be different because they are less repetitive (usually less coverage), so their compression relies more on CPU-expensive crunching of the "2bit nature" of the DNA. So they might logically suffer more from a lower compression level. It might be instructive to compare compression sizes of raw sequence data of two BAM files with output of faToTwoBit / 2.bit files. 2bit files are compressed by a factor of around four, which gzip often does not reach (because it doesn't know ahead of time that DNA has only four letters). Use reference genome fasta as proxy for nearly no repetition at all. It doesn't compress much beyond 2bit.
Tweaking of the Huffmann coding etc. might have influenced the compression level much in this case, by "giving the compressor a subtle hint about the four letters". Paradoxically, Intel might have optimized for average data and thus brought a disadvantage for the four letter nature of DNA (and also the few letters used in quality data encoding compared to text).
BQSR: When I did interleaving compression experiments, I noticed that the BQSR step decreases compressiblity considerably. In this example I had the same BAM file in different versions that were aligned to hs38DH, hs38, hs37d5 and could compress them to nearly the size of one, by putting similar pieces of the files after one another. Adding the same BAM with BQSR increased final file size more than several pre-BQSR versions together. Note: This piece-meal packing might be useful for different BAMs mostly only with many BAMs where similar regions accumulate.
Even faster: In my experience, level 0 (no compression) (with samtools view -u) increases speed even more, if files are on a lz4 encrypted disk (such as with ZFS). The speed-up of lz4 over even level 1 of any gzip-like compression is substantial. With data on SSDs or similarly fast storage, that can make a huge difference.
Another factor 6 six faster than level 1 on compression and a factor 9 on decompression.
The then possible decompression speed of 3GB/s makes it possible to e.g. load a 180GB bam into a RAM disk in 60 seconds on a sufficiently fast SSD array (e.g. as on an aws ec2 i3.8xlarge instance). Still 600MB/s if the RAM is also lz4 compressed.
See image from https://github.com/lz4/lz4 below.
The original reason for this issue was to investigate why BAM files compressed with GKL level 1 DEFLATE compression are larger than BAM files compressed with JDK level 5 DEFLATE compression. Stating the obvious, level 1 compressed files are expected to be larger file than level 5 compressed files. However, GKL level 1 and 2 trade compression ratio for compression speed, so we ran some experiments on data from #4249 to investigate.
The chart below shows compression speed vs. compression ratio for 11 files compressed using GKL (use-jdk-deflater = False) and JDK (use-jdk-deflater = True) for compression levels 1 through 9. The chart shows BAM files with binned quality scores can be compressed more given more CPU effort (i.e. higher compression level). The effect is the same for GKL compression and JDK compression, but the compression ratio difference is higher for GKL due to the ratio vs. speed trade off.
My conclusion is GATK default compression settings work well for some use models (i.e. minimizing cloud compute cost). The default settings may not be optimal for all use models, so users are free to tune the compression options to minimize their own cost function.
@gspowley Thank you for the analysis. Could you upload the raw data table here too? It's a bit hard to make direct comparisons from the plot.
Sure, here's the raw data: compression-study.csv.txt
Closing -- we addressed this by migrating to compression level 2 as the default.
We've been showing people the plots below on deflater/inflater performance to argue that using compression level 1 always make sense, since in our tests it results in only a slightly larger file, but much better performance:
These tests were performed by @gspowley on whole-genome bams.
@tfenne reports, however, that in his tests with smaller bams (on the order of 1-2 GB), he is seeing a significantly larger final file with the Intel deflater on compression level 1 than with the JDK deflater on compression level 5. See https://github.com/broadinstitute/picard/issues/879 for the full discussion.
We should investigate to better understand how the gap between final file size on compression level 1 vs. level 5 changes with different kinds of input. We should test with variously-sized files (2 GB bam snippets, 30 GB exomes, and ~200 GB genomes), and record final compressed size, time to write the compressed file, and time to read the compressed file on both levels 1 and 5. We should also be sure to test with some whole-genome bams other than the ones used in the initial evaluation, to make sure that the deflater has not been hyper-optimized for just certain specific samples/data.