foundation-model-stack / fms-hf-tuning

🚀 Collection of tuning recipes with HuggingFace SFTTrainer and PyTorch FSDP.
Apache License 2.0
9 stars 30 forks source link

Performance regression in fms-hf-tuning image #201

Open kpouget opened 1 week ago

kpouget commented 1 week ago

Describe the bug

As part of our performance evaluation of the fms-hf-tuning fine-tuning stack, we observed a regression between the images we tested:

Platform

OpenShift AI

Steps to reproduce

  1. Create these 3 objects in a namespace,
  2. See the Pod being executed with the quay.io/modh/fms-hf-tuning:release image, currently hitting the regression
  3. When the Pod finishes, create an interactive Pod with root privileges (to be able to install the new package)
    oc debug fine-tuning-master-0 --as-root
  4. Prepare the environment, by running the script a first time
    $ bash /mnt/entrypoint/entrypoint.sh
    + grep transformers
    transformers==4.41.2
    ...
    {'train_runtime': 146.0027, 'train_samples_per_second': 35.616, 'train_steps_per_second': 2.226, 'train_tokens_per_second': 6719.494, 'train_loss': 2.377412109375, 'epoch': 1.0}
  5. install the 4.40.2 version of transformers and tell Python to use it
    export HOME=/tmp
    pip install transformers==4.40.2 --user
    export PYTHONPATH=/tmp/.local/lib/python3.11/site-packages/:$PYTHONPATH
  6. Run the script again, with the 4.40 version of transformers
    $ bash /mnt/entrypoint/entrypoint.sh
    + grep transformers
    transformers==4.40.2
    ...
    {'train_runtime': 88.4812, 'train_samples_per_second': 58.77, 'train_steps_per_second': 3.673, 'train_tokens_per_second': 11087.825, 'train_loss': 2.377412109375, 'epoch': 1.0}
  7. Compare the train_runtime and train_samples_per_second etc and notice the difference.

Note that:

Expected behavior

image

Observed behavior

image

See also

RHOAIENG-8551

alex-jw-brooks commented 3 days ago

Hey @kpouget, thanks for raising this issue and for providing the kube configs, it was super helpful! I did some investigation in the quay.io/modh/fms-hf-tuning:release image and had some findings that were similar, and some that were a bit surprising (using the same model, bigcode/gpt_bigcode-santacoder).

The short story is that I was able to reproduce your findings for 4.41.2 being slower using the same dataset & 0.1 for dataset replication. However, it also seems like the speed of the tuning is inconsistent. For my tunings, I was generally seeing:

4.40.2 - most trainings took 500-600 seconds, and generally sat around 550.

4.41.2 - some trainings took ~700+ seconds. However, some ran way faster (100-150) seconds, where all trainings here leveraged the same config, and produced the same loss. E.g.,

{'train_runtime': 692.3355, 'train_samples_per_second': 7.511, 'train_steps_per_second': 0.469, 'train_tokens_per_second': 1417.036, 'train_loss': 2.377412109375, 'epoch': 1.0}

{'train_runtime': 109.6998, 'train_samples_per_second': 47.402, 'train_steps_per_second': 2.963, 'train_tokens_per_second': 8943.169, 'train_loss': 2.377412109375, 'epoch': 1.0}

In all experiments that I ran, the tuning was running in a same pod and doing the data formatting etc wit the same mounts the pods spawned with the custom resource leverage. I'm curious if there is some caching at work here, but it is strange - I should also mention that the time per step seems pretty consistent in the progress bar outputs once things have actually started.

I did some investigation into the concurrency issues as well, and was not able to produce any issues there with either version of transformers - I did see the same weird behavior from 4.41.2 where some runs were going faster than others, but it didn't seem to depend on jobs running in the cluster concurrently, as I saw slow/fast runs with 4.41.2 when just one job was running also.

After discussing with @anhuong / @Ssukriti, we've decided to pin to 4.40 until the reasons for these inconsistent timings has been more thoroughly investigated

Ssukriti commented 2 days ago

Thank you Alex! Michael's team will be investigating further to understand impact over larger set of tests with granite models.

There was a request to get full set of dependencies in the latest RH image to verify other dependencies besides transformers.

The RH image mentioned above is quay.io/modh/fms-hf-tuning:84b0337b7baee119e909d4e901b6dadfe34c1f9a -> @anhuong could you pull out dependency list from this image and attach it here? Alternatively you could use any of our internal images that correspond to 0.3 release and get list as well. Thank you!

anhuong commented 2 days ago

quay.io/modh/fms-hf-tuning:84b0337b7baee119e909d4e901b6dadfe34c1f9a was built a month ago, May 31st. pip freeze deps:

accelerate==0.30.1
aiohttp==3.9.5
aiosignal==1.3.1
attrs==23.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
datasets==2.19.1
dill==0.3.8
docstring_parser==0.16
einops==0.8.0
filelock==3.14.0
fire==0.6.0
flash-attn==2.5.9.post1
fms-hf-tuning @ file:///tmp/fms_hf_tuning-0.1.dev1%2Bg84b0337.d20240531-py3-none-any.whl#sha256=fbb236a382849683a7fd49fd9e341440652b08eaf0353edee052952de7dab696
frozenlist==1.4.1
fsspec==2024.3.1
huggingface-hub==0.23.2
idna==3.7
Jinja2==3.1.4
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
networkx==3.3
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.40
nvidia-nvtx-cu12==12.1.105
packaging==24.0
pandas==2.2.2
peft==0.11.1
psutil==5.9.8
pyarrow==16.1.0
pyarrow-hotfix==0.6
Pygments==2.18.0
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
regex==2024.5.15
requests==2.32.3
rich==13.7.1
safetensors==0.4.3
sentencepiece==0.2.0
shtab==1.7.1
simpleeval==0.9.13
six==1.16.0
sympy==1.12.1
termcolor==2.4.0
tokenizers==0.19.1
torch==2.3.0
tqdm==4.66.4
transformers==4.41.2
triton==2.3.0
trl==0.8.6
typing_extensions==4.12.0
tyro==0.8.4
tzdata==2024.1
urllib3==2.2.1
xxhash==3.4.1
yarl==1.9.4