68 introduced the use of a new image for the pytorch job. As raised in https://github.com/canonical/charmed-kubeflow-uats/pull/68#discussion_r1640316804 by @motjuste, that new image is 10x larger than the previous one (~1G vs ~10G) which could create issues in our testing CI. As @motjuste let me know, we have faced resource exhaustion issues with the current limits we have set in our testing of Kubeflow on AWS while also issues downloading such large images from DockerHub getting rate-limited while caching such large ones in our registry cache comes with a cost. While also Jupyter notebooks with CUDA are expected to be of similar size in real world scenarios, using such a large image for testing is an overkill. Thus, we need to spend some time investigating the option of rebuilding the image.
Techincal
The above is probably caused from the fact that this image contains cuda, which is not needed for the testing that we do. The upstream dockerfile is using as a base image the nvidia one, and it was introduced in this PR. The previous one used could be coming from that Dockerfile
What needs to get done
Investigate if we can rebuild the file with a smaller base image.
Definition of Done
There is a proposed solution for an image of a smaller size or the image itself.
Context
68 introduced the use of a new image for the pytorch job. As raised in https://github.com/canonical/charmed-kubeflow-uats/pull/68#discussion_r1640316804 by @motjuste, that new image is 10x larger than the previous one (~1G vs ~10G) which could create issues in our testing CI. As @motjuste let me know, we have faced resource exhaustion issues with the current limits we have set in our testing of Kubeflow on AWS while also issues downloading such large images from DockerHub getting rate-limited while caching such large ones in our registry cache comes with a cost. While also Jupyter notebooks with CUDA are expected to be of similar size in real world scenarios, using such a large image for testing is an overkill. Thus, we need to spend some time investigating the option of rebuilding the image.
Techincal
The above is probably caused from the fact that this image contains cuda, which is not needed for the testing that we do. The upstream dockerfile is using as a base image the nvidia one, and it was introduced in this PR. The previous one used could be coming from that Dockerfile
What needs to get done
Investigate if we can rebuild the file with a smaller base image.
Definition of Done
There is a proposed solution for an image of a smaller size or the image itself.