foundation-model-stack / fms-hf-tuning

🚀 Collection of tuning recipes with HuggingFace SFTTrainer and PyTorch FSDP.
Apache License 2.0
22 stars 41 forks source link

bug: Pod runs out of ephemeral storage (disk) space because of the temporary directory. #217

Closed HarikrishnanBalagopal closed 1 month ago

HarikrishnanBalagopal commented 3 months ago

Overview

https://github.com/foundation-model-stack/fms-hf-tuning/blob/09496999edbd02d656ae2fef778b30c137afc433/build/accelerate_launch.py#L93

While running the image in a K8s/Openshift Pod, since the intermediate checkpoints are being written to a temporary folder, the Pod runs out of disk space.

Steps

  1. Start a pod with the https://github.com/foundation-model-stack/fms-hf-tuning/tree/main/build image and decent disk space. Example: ephemeralStorage: "20Gi"
  2. Let training run for 10 epochs with https://huggingface.co/ibm-granite/granite-7b-base model and twitter complaints dataset https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter/data
  3. At around Epoch 7 training crashes with "cannot write to file/disk space" error.

Expected Behaviour

We should be able to run the training without running out of space.

Proposed Fix

Let the user configure the temporary folder. That way we can use a folder inside a PVC so that we don't run out of disk space on the Pod.

anhuong commented 1 month ago

PR that addresses issue merged in