canonical / charmed-kubeflow-uats

Automated UATs for Charmed Kubeflow
Apache License 2.0
6 stars 2 forks source link

training-operator: Investigate rebuilding the pytorch example image #69

Closed orfeas-k closed 3 months ago

orfeas-k commented 3 months ago

Context

68 introduced the use of a new image for the pytorch job. As raised in https://github.com/canonical/charmed-kubeflow-uats/pull/68#discussion_r1640316804 by @motjuste, that new image is 10x larger than the previous one (~1G vs ~10G) which could create issues in our testing CI. As @motjuste let me know, we have faced resource exhaustion issues with the current limits we have set in our testing of Kubeflow on AWS while also issues downloading such large images from DockerHub getting rate-limited while caching such large ones in our registry cache comes with a cost. While also Jupyter notebooks with CUDA are expected to be of similar size in real world scenarios, using such a large image for testing is an overkill. Thus, we need to spend some time investigating the option of rebuilding the image.

Techincal

The above is probably caused from the fact that this image contains cuda, which is not needed for the testing that we do. The upstream dockerfile is using as a base image the nvidia one, and it was introduced in this PR. The previous one used could be coming from that Dockerfile

What needs to get done

Investigate if we can rebuild the file with a smaller base image.

Definition of Done

There is a proposed solution for an image of a smaller size or the image itself.

syncronize-issues-to-jira[bot] commented 3 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5882.

This message was autogenerated