broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.62k stars 575 forks source link

GATK Reduced Docker Layers for ACR #8808

Closed kevinpalis closed 3 weeks ago

kevinpalis commented 1 month ago

As part of my work in the Pipeline Dev team, I created 2 GATK images to address issue discussed here (ie. having too many docker layers, we hit ACR limits very quickly). The images are in terrapublic, a premium-tier ACR and is publicly accessible. I made two images, one is squashed to just 1 layer, the other is reduced to just 12 layers (from the original 45). With these changes and the fact that terrapublic is on premium tier, the maximum docker pulls per minute becomes 833 (ie. 10k readOps / 12 layers) for the reduced-layers image and 10,000 for the squashed one. We have yet to test these in our pipelines but I anticipate the squashed version to be slower since it won’t be able to take advantage of any parallel pulls or caching, hence the two versions to allow pipeline devs to decide which one is better for their use-case.

kevinpalis commented 3 weeks ago

The new changes to the base image Dockerfile look good to me, @kevinpalis ! Can you tell us how many layers we have total after these changes? Is there any value in pursuing a full squash, or do you think that with this patch most users' issues will be resolved?

@droazen , the total layers is now down to 16 (from 44). I honestly don't see the value of doing a full squash, mainly because if we are hosting this in a premium ACR, the limit is 10,000 readOps per minute. So with 16 layers, you get around 625 pulls per minute. Also, this will be able to still take advantage of parallel pulls (default is 3, but at most 16 threads in this case, I believe) as opposed to one big layer which will not download in parallel. There's the potential of that being a lot slower and subsequent jobs falling into the same "minute" because others are not done, making it easier to hit that 10k readOps limit. Lastly, people using GATK outside data pipelines will not be able to take advantage of layer caching too.