Open kpouget opened 1 week ago
Hey @kpouget, thanks for raising this issue and for providing the kube configs, it was super helpful! I did some investigation in the quay.io/modh/fms-hf-tuning:release
image and had some findings that were similar, and some that were a bit surprising (using the same model, bigcode/gpt_bigcode-santacoder
).
The short story is that I was able to reproduce your findings for 4.41.2
being slower using the same dataset & 0.1 for dataset replication. However, it also seems like the speed of the tuning is inconsistent. For my tunings, I was generally seeing:
4.40.2
- most trainings took 500-600 seconds, and generally sat around 550.
4.41.2
- some trainings took ~700+ seconds. However, some ran way faster (100-150) seconds, where all trainings here leveraged the same config, and produced the same loss. E.g.,
{'train_runtime': 692.3355, 'train_samples_per_second': 7.511, 'train_steps_per_second': 0.469, 'train_tokens_per_second': 1417.036, 'train_loss': 2.377412109375, 'epoch': 1.0}
{'train_runtime': 109.6998, 'train_samples_per_second': 47.402, 'train_steps_per_second': 2.963, 'train_tokens_per_second': 8943.169, 'train_loss': 2.377412109375, 'epoch': 1.0}
In all experiments that I ran, the tuning was running in a same pod and doing the data formatting etc wit the same mounts the pods spawned with the custom resource leverage. I'm curious if there is some caching at work here, but it is strange - I should also mention that the time per step seems pretty consistent in the progress bar outputs once things have actually started.
I did some investigation into the concurrency issues as well, and was not able to produce any issues there with either version of transformers - I did see the same weird behavior from 4.41.2
where some runs were going faster than others, but it didn't seem to depend on jobs running in the cluster concurrently, as I saw slow/fast runs with 4.41.2
when just one job was running also.
After discussing with @anhuong / @Ssukriti, we've decided to pin to 4.40
until the reasons for these inconsistent timings has been more thoroughly investigated
Thank you Alex! Michael's team will be investigating further to understand impact over larger set of tests with granite models.
There was a request to get full set of dependencies in the latest RH image to verify other dependencies besides transformers.
The RH image mentioned above is quay.io/modh/fms-hf-tuning:84b0337b7baee119e909d4e901b6dadfe34c1f9a
-> @anhuong could you pull out dependency list from this image and attach it here? Alternatively you could use any of our internal images that correspond to 0.3 release and get list as well. Thank you!
quay.io/modh/fms-hf-tuning:84b0337b7baee119e909d4e901b6dadfe34c1f9a
was built a month ago, May 31st.
pip freeze deps:
accelerate==0.30.1
aiohttp==3.9.5
aiosignal==1.3.1
attrs==23.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
datasets==2.19.1
dill==0.3.8
docstring_parser==0.16
einops==0.8.0
filelock==3.14.0
fire==0.6.0
flash-attn==2.5.9.post1
fms-hf-tuning @ file:///tmp/fms_hf_tuning-0.1.dev1%2Bg84b0337.d20240531-py3-none-any.whl#sha256=fbb236a382849683a7fd49fd9e341440652b08eaf0353edee052952de7dab696
frozenlist==1.4.1
fsspec==2024.3.1
huggingface-hub==0.23.2
idna==3.7
Jinja2==3.1.4
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
networkx==3.3
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.40
nvidia-nvtx-cu12==12.1.105
packaging==24.0
pandas==2.2.2
peft==0.11.1
psutil==5.9.8
pyarrow==16.1.0
pyarrow-hotfix==0.6
Pygments==2.18.0
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
regex==2024.5.15
requests==2.32.3
rich==13.7.1
safetensors==0.4.3
sentencepiece==0.2.0
shtab==1.7.1
simpleeval==0.9.13
six==1.16.0
sympy==1.12.1
termcolor==2.4.0
tokenizers==0.19.1
torch==2.3.0
tqdm==4.66.4
transformers==4.41.2
triton==2.3.0
trl==0.8.6
typing_extensions==4.12.0
tyro==0.8.4
tzdata==2024.1
urllib3==2.2.1
xxhash==3.4.1
yarl==1.9.4
Describe the bug
As part of our performance evaluation of the fms-hf-tuning fine-tuning stack, we observed a regression between the images we tested:
quay.io/modh/fms-hf-tuning:01b3824c9aba22d9d0695399681e6f0507840e7f --> fast (April 17)
quay.io/modh/fms-hf-tuning:bd8bf628cd739c7a201a976bc3c1096785353f1a --> slow (May 26)
quay.io/modh/fms-hf-tuning:84b0337b7baee119e909d4e901b6dadfe34c1f9a --> slow (May 21, planned for delivery)
Additionally, we observed that with the bd8bf628cd739c7a201a976bc3c1096785353f1a image, the multi-model tuning performance (multiple independent fine-tuning jobs, running on a dedicated GPU) are degraded compared to the reference run (one job running alone in the cluster). This wasn't the case in our first tests with 84b0337b7baee119e909d4e901b6dadfe34c1f9a.
fine tuning of gpt_bigcode-santacoder against the Alpaca dataset
Platform
OpenShift AI
Steps to reproduce
quay.io/modh/fms-hf-tuning:release
image, currently hitting the regressiontransformers
train_runtime
andtrain_samples_per_second
etc and notice the difference.Note that:
DATASET_REPLICATION
. Set to0.1
to get results faster.Expected behavior
Observed behavior
See also
RHOAIENG-8551