NationalGenomicsInfrastructure / piper

A genomics pipeline build on top of the GATK Queue framework
9 stars 9 forks source link

Lower compression levels for output files #11

Closed johandahlberg closed 10 years ago

johandahlberg commented 10 years ago

I performed a little experiment testing the effect on the run time of changing the compression levels of Picard MarkDuplicates. It turns out the there is actually very little effect on the file size when the compression level is above 6, while the run time is doubled between compression levels 6 and 9.

This test was carried out running from a single node in sequence, which means that the results might be different if all codes on the node were active and saturating the write bandwidth (which I'm not sure it was doing at the moment).

image

Of course this is not a very scientific experiment, but I think that it makes a pretty good case for not setting the compression level above 6.

@vezzi this is also the reason why you've observed a lot longer run time from piper in the latest versions. Looking at the plot above would you agree that 6 is a good compromise between speed and file size?

vezzi commented 10 years ago

I would argue that 4 and 5 is also a really good compromises.... they do not affect too much the disk space but they deeply affect the run time.

F

On 15 Jul 2014, at 13:39, Johan Dahlberg notifications@github.com wrote:

I performed a little experiment testing the effect on the run time of changing the compression levels. It turns out the there is actually very little effect on the file size when the compression level is above 6, while the run time is doubled between compression levels 6 and 9.

This test was carried out running from a single node in sequence, which means that the results might be different if all codes on the node were active and saturating the write bandwidth (which I'm not sure it was doing at the moment).

Of course this is not a very scientific experiment, but I think that it makes a pretty good case for not setting the compression level above 6.

@vezzi this is also the reason why you've observed a lot longer run time from piper in the latest versions. Looking at the plot above would you agree that 6 is a good compromise between speed and file size?

— Reply to this email directly or view it on GitHub.

johandahlberg commented 10 years ago

I'll lower the compression levels to 5 and then we can benchmark again - I think it should bring down the full run time of the pipeline to the previous 3 days.