Performance regression in fms-hf-tuning image

Describe the bug

As part of our performance evaluation of the fms-hf-tuning fine-tuning stack, we observed a regression between the images we tested:

quay.io/modh/fms-hf-tuning:01b3824c9aba22d9d0695399681e6f0507840e7f --> fast (April 17)

{'train_runtime': 829.2865, 'train_samples_per_second': 62.707, 'train_steps_per_second': 3.919, 'train_tokens_per_second': 12141.512, 'train_loss': 2.287814903846154, 'epoch': 1.0}

quay.io/modh/fms-hf-tuning:bd8bf628cd739c7a201a976bc3c1096785353f1a --> slow (May 26)

{'train_runtime': 1391.6176, 'train_samples_per_second': 37.368, 'train_steps_per_second': 2.335, 'train_tokens_per_second': 7235.315, 'train_loss': 2.287858623798077, 'epoch': 1.0}

quay.io/modh/fms-hf-tuning:84b0337b7baee119e909d4e901b6dadfe34c1f9a --> slow (May 21, planned for delivery)
```
{'train_runtime': 1404.3311, 'train_samples_per_second': 37.03, 'train_steps_per_second': 2.314, 'train_tokens_per_second': 7169.813, 'train_loss': 2.2878201622596155, 'epoch': 1.0}
```
Additionally, we observed that with the bd8bf628cd739c7a201a976bc3c1096785353f1a image, the multi-model tuning performance (multiple independent fine-tuning jobs, running on a dedicated GPU) are degraded compared to the reference run (one job running alone in the cluster). This wasn't the case in our first tests with 84b0337b7baee119e909d4e901b6dadfe34c1f9a.
fine tuning of gpt_bigcode-santacoder against the Alpaca dataset

Platform

OpenShift AI

Steps to reproduce

Create these 3 objects in a namespace,
See the Pod being executed with the quay.io/modh/fms-hf-tuning:release image, currently hitting the regression
When the Pod finishes, create an interactive Pod with root privileges (to be able to install the new package)
```
oc debug fine-tuning-master-0 --as-root
```

Prepare the environment, by running the script a first time

$ bash /mnt/entrypoint/entrypoint.sh
+ grep transformers
transformers==4.41.2
...
{'train_runtime': 146.0027, 'train_samples_per_second': 35.616, 'train_steps_per_second': 2.226, 'train_tokens_per_second': 6719.494, 'train_loss': 2.377412109375, 'epoch': 1.0}

install the 4.40.2 version of transformers and tell Python to use it

export HOME=/tmp
pip install transformers==4.40.2 --user
export PYTHONPATH=/tmp/.local/lib/python3.11/site-packages/:$PYTHONPATH

Run the script again, with the 4.40 version of transformers

$ bash /mnt/entrypoint/entrypoint.sh
+ grep transformers
transformers==4.40.2
...
{'train_runtime': 88.4812, 'train_samples_per_second': 58.77, 'train_steps_per_second': 3.673, 'train_tokens_per_second': 11087.825, 'train_loss': 2.377412109375, 'epoch': 1.0}

Compare the train_runtime and train_samples_per_second etc and notice the difference.

Note that:

you can configure the length of the dataset with DATASET_REPLICATION. Set to 0.1 to get results faster.
the version pinning has been tested without the fiddling above, only required to make a simple reproducer.

Expected behavior

Observed behavior

Hey @kpouget, thanks for raising this issue and for providing the kube configs, it was super helpful! I did some investigation in the quay.io/modh/fms-hf-tuning:release image and had some findings that were similar, and some that were a bit surprising (using the same model, bigcode/gpt_bigcode-santacoder).

The short story is that I was able to reproduce your findings for 4.41.2 being slower using the same dataset & 0.1 for dataset replication. However, it also seems like the speed of the tuning is inconsistent. For my tunings, I was generally seeing:

4.40.2 - most trainings took 500-600 seconds, and generally sat around 550.

4.41.2 - some trainings took ~700+ seconds. However, some ran way faster (100-150) seconds, where all trainings here leveraged the same config, and produced the same loss. E.g.,

{'train_runtime': 692.3355, 'train_samples_per_second': 7.511, 'train_steps_per_second': 0.469, 'train_tokens_per_second': 1417.036, 'train_loss': 2.377412109375, 'epoch': 1.0}

{'train_runtime': 109.6998, 'train_samples_per_second': 47.402, 'train_steps_per_second': 2.963, 'train_tokens_per_second': 8943.169, 'train_loss': 2.377412109375, 'epoch': 1.0}

In all experiments that I ran, the tuning was running in a same pod and doing the data formatting etc wit the same mounts the pods spawned with the custom resource leverage. I'm curious if there is some caching at work here, but it is strange - I should also mention that the time per step seems pretty consistent in the progress bar outputs once things have actually started.

I did some investigation into the concurrency issues as well, and was not able to produce any issues there with either version of transformers - I did see the same weird behavior from 4.41.2 where some runs were going faster than others, but it didn't seem to depend on jobs running in the cluster concurrently, as I saw slow/fast runs with 4.41.2 when just one job was running also.

After discussing with @anhuong / @Ssukriti, we've decided to pin to 4.40 until the reasons for these inconsistent timings has been more thoroughly investigated

Thank you Alex! Michael's team will be investigating further to understand impact over larger set of tests with granite models.

There was a request to get full set of dependencies in the latest RH image to verify other dependencies besides transformers.

The RH image mentioned above is quay.io/modh/fms-hf-tuning:84b0337b7baee119e909d4e901b6dadfe34c1f9a -> @anhuong could you pull out dependency list from this image and attach it here? Alternatively you could use any of our internal images that correspond to 0.3 release and get list as well. Thank you!

quay.io/modh/fms-hf-tuning:84b0337b7baee119e909d4e901b6dadfe34c1f9a was built a month ago, May 31st. pip freeze deps:

accelerate==0.30.1
aiohttp==3.9.5
aiosignal==1.3.1
attrs==23.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
datasets==2.19.1
dill==0.3.8
docstring_parser==0.16
einops==0.8.0
filelock==3.14.0
fire==0.6.0
flash-attn==2.5.9.post1
fms-hf-tuning @ file:///tmp/fms_hf_tuning-0.1.dev1%2Bg84b0337.d20240531-py3-none-any.whl#sha256=fbb236a382849683a7fd49fd9e341440652b08eaf0353edee052952de7dab696
frozenlist==1.4.1
fsspec==2024.3.1
huggingface-hub==0.23.2
idna==3.7
Jinja2==3.1.4
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
networkx==3.3
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.40
nvidia-nvtx-cu12==12.1.105
packaging==24.0
pandas==2.2.2
peft==0.11.1
psutil==5.9.8
pyarrow==16.1.0
pyarrow-hotfix==0.6
Pygments==2.18.0
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
regex==2024.5.15
requests==2.32.3
rich==13.7.1
safetensors==0.4.3
sentencepiece==0.2.0
shtab==1.7.1
simpleeval==0.9.13
six==1.16.0
sympy==1.12.1
termcolor==2.4.0
tokenizers==0.19.1
torch==2.3.0
tqdm==4.66.4
transformers==4.41.2
triton==2.3.0
trl==0.8.6
typing_extensions==4.12.0
tyro==0.8.4
tzdata==2024.1
urllib3==2.2.1
xxhash==3.4.1
yarl==1.9.4

foundation-model-stack / fms-hf-tuning