After taking a look through the sagemaker-spark-processing:3.1-cpu-py37-v1.1 image using the dive tool, I noticed that the cache files for installations were not getting cleaned up, leading to an unnecessary increase in image size.
In particular line 13 has no effect as the layer that installs the yum packages on line 6 is immutable at that point. This leads to around 30-40% of the image size being allocated to caches:
By adding the cleanup at the end of the layer definition, the cleanup actually works, significantly reducing the size of the image:
By merging a couple of layers and doing cleanup we are able to shrink the image size from 4.4GB to 2.5GB, that should lead to faster container spin-up. The changes are available in my fork, if the maintainers agree I can try opening a PR with these changes for the Dockerfiles that could benefit.
After taking a look through the
sagemaker-spark-processing:3.1-cpu-py37-v1.1
image using the dive tool, I noticed that the cache files for installations were not getting cleaned up, leading to an unnecessary increase in image size.In particular line 13 has no effect as the layer that installs the yum packages on line 6 is immutable at that point. This leads to around 30-40% of the image size being allocated to caches:
By adding the cleanup at the end of the layer definition, the cleanup actually works, significantly reducing the size of the image:
By merging a couple of layers and doing cleanup we are able to shrink the image size from 4.4GB to 2.5GB, that should lead to faster container spin-up. The changes are available in my fork, if the maintainers agree I can try opening a PR with these changes for the Dockerfiles that could benefit.