Closed HarikrishnanBalagopal closed 1 month ago
https://github.com/foundation-model-stack/fms-hf-tuning/blob/09496999edbd02d656ae2fef778b30c137afc433/build/accelerate_launch.py#L93
While running the image in a K8s/Openshift Pod, since the intermediate checkpoints are being written to a temporary folder, the Pod runs out of disk space.
ephemeralStorage: "20Gi"
We should be able to run the training without running out of space.
Let the user configure the temporary folder. That way we can use a folder inside a PVC so that we don't run out of disk space on the Pod.
PR that addresses issue merged in
Overview
https://github.com/foundation-model-stack/fms-hf-tuning/blob/09496999edbd02d656ae2fef778b30c137afc433/build/accelerate_launch.py#L93
While running the image in a K8s/Openshift Pod, since the intermediate checkpoints are being written to a temporary folder, the Pod runs out of disk space.
Steps
ephemeralStorage: "20Gi"
Expected Behaviour
We should be able to run the training without running out of space.
Proposed Fix
Let the user configure the temporary folder. That way we can use a folder inside a PVC so that we don't run out of disk space on the Pod.