Context

68#discussion_r1640316804 by @motjuste, that new image is 10x larger than the previous one (~1G vs ~10G) which could create issues in our testing CI. As @motjuste let me know, we have faced resource exhaustion issues with the current limits we have set in our testing of Kubeflow on AWS while also issues downloading such large images from DockerHub getting rate-limited while caching such large ones in our registry cache comes with a cost. While also Jupyter notebooks with CUDA are expected to be of similar size in real world scenarios, using such a large image for testing is an overkill. Thus, we need to spend some time investigating the option of rebuilding the image.

Techincal

The above is probably caused from the fact that this image contains cuda, which is not needed for the testing that we do. The upstream dockerfile is using as a base image the nvidia one, and it was introduced in this PR. The previous one used could be coming from that Dockerfile

What needs to get done

Investigate if we can rebuild the file with a smaller base image.

Definition of Done

There is a proposed solution for an image of a smaller size or the image itself.

canonical / charmed-kubeflow-uats

training-operator: Investigate rebuilding the pytorch example image #69

Context

Techincal

What needs to get done

Definition of Done