Errors occur while fine-tuning the full parameters of Whisper-small

JiamingZhou777 commented 1 month ago

I encountered errors while trying to fine-tune the full parameters of Whisper-small. I have installed transformers==4.32.1. My environment details are listed below. Do you have any suggestions? Thanks!

[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/storage1/jiaming_space/project/SPAPL_KidsASR//src/bin/train_asr.py", line 324, in <module>
[rank0]:     main()
[rank0]:   File "/mnt/storage1/jiaming_space/project/SPAPL_KidsASR//src/bin/train_asr.py", line 315, in main
[rank0]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank0]:   File "/home/zhoujiaming/anaconda3/envs/child/lib/python3.10/site-packages/transformers/trainer.py", line 1559, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/home/zhoujiaming/anaconda3/envs/child/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in _inner_training_loop
[rank0]:     self.model.gradient_checkpointing_enable()
[rank0]:   File "/home/zhoujiaming/anaconda3/envs/child/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1715, in gradient_checkpointing_enable
[rank0]:     self.apply(partial(self._set_gradient_checkpointing, value=True))
[rank0]:   File "/home/zhoujiaming/anaconda3/envs/child/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1729, in __getattr__
[rank0]:     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
[rank0]: AttributeError: 'WhisperForConditionalGeneration' object has no attribute '_set_gradient_checkpointing'. Did you mean: 'is_gradient_checkpointing'?
[rank1]: Traceback (most recent call last):
[rank1]:   File "/mnt/storage1/jiaming_space/project/SPAPL_KidsASR//src/bin/train_asr.py", line 324, in <module>
[rank1]:     main()
[rank1]:   File "/mnt/storage1/jiaming_space/project/SPAPL_KidsASR//src/bin/train_asr.py", line 315, in main
[rank1]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank1]:   File "/home/zhoujiaming/anaconda3/envs/child/lib/python3.10/site-packages/transformers/trainer.py", line 1559, in train
[rank1]:     return inner_training_loop(
[rank1]:   File "/home/zhoujiaming/anaconda3/envs/child/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in _inner_training_loop
[rank1]:     self.model.gradient_checkpointing_enable()
[rank1]:   File "/home/zhoujiaming/anaconda3/envs/child/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1715, in gradient_checkpointing_enable
[rank1]:     self.apply(partial(self._set_gradient_checkpointing, value=True))
[rank1]:   File "/home/zhoujiaming/anaconda3/envs/child/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1729, in __getattr__
[rank1]:     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
[rank1]: AttributeError: 'WhisperForConditionalGeneration' object has no attribute '_set_gradient_checkpointing'. Did you mean: 'is_gradient_checkpointing'?
W0912 11:14:32.462501 140640371372416 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1464768 closing signal SIGTERM
E0912 11:14:32.728592 140640371372416 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 1464767) of binary: /home/zhoujiaming/anaconda3/envs/child/bin/python
Traceback (most recent call last):
  File "/home/zhoujiaming/anaconda3/envs/child/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/zhoujiaming/anaconda3/envs/child/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/zhoujiaming/anaconda3/envs/child/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/zhoujiaming/anaconda3/envs/child/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/zhoujiaming/anaconda3/envs/child/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/zhoujiaming/anaconda3/envs/child/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/mnt/storage1/jiaming_space/project/SPAPL_KidsASR//src/bin/train_asr.py FAILED

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
accelerate                0.34.2                   pypi_0    pypi
aiohappyeyeballs          2.4.0                    pypi_0    pypi
aiohttp                   3.10.5                   pypi_0    pypi
aiosignal                 1.3.1                    pypi_0    pypi
async-timeout             4.0.3                    pypi_0    pypi
attrs                     24.2.0                   pypi_0    pypi
audioread                 3.0.1                    pypi_0    pypi
bzip2                     1.0.8                h5eee18b_6  
ca-certificates           2024.7.2             h06a4308_0  
certifi                   2024.8.30                pypi_0    pypi
cffi                      1.17.1                   pypi_0    pypi
charset-normalizer        3.3.2                    pypi_0    pypi
click                     8.1.7                    pypi_0    pypi
dataclasses-json          0.6.7                    pypi_0    pypi
datasets                  3.0.0                    pypi_0    pypi
decorator                 5.1.1                    pypi_0    pypi
dill                      0.3.8                    pypi_0    pypi
evaluate                  0.4.3                    pypi_0    pypi
filelock                  3.15.4                   pypi_0    pypi
frozenlist                1.4.1                    pypi_0    pypi
fsspec                    2024.6.1                 pypi_0    pypi
huggingface-hub           0.24.6                   pypi_0    pypi
idna                      3.8                      pypi_0    pypi
jinja2                    3.1.4                    pypi_0    pypi
jiwer                     3.0.4                    pypi_0    pypi
joblib                    1.4.2                    pypi_0    pypi
lazy-loader               0.4                      pypi_0    pypi
ld_impl_linux-64          2.38                 h1181459_1  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
librosa                   0.10.2.post1             pypi_0    pypi
libstdcxx-ng              11.2.0               h1234567_1  
libuuid                   1.41.5               h5eee18b_0  
llvmlite                  0.43.0                   pypi_0    pypi
markupsafe                2.1.5                    pypi_0    pypi
marshmallow               3.22.0                   pypi_0    pypi
mpmath                    1.3.0                    pypi_0    pypi
msgpack                   1.1.0                    pypi_0    pypi
multidict                 6.1.0                    pypi_0    pypi
multiprocess              0.70.16                  pypi_0    pypi
mypy-extensions           1.0.0                    pypi_0    pypi
ncurses                   6.4                  h6a678d5_0  
networkx                  3.3                      pypi_0    pypi
numba                     0.60.0                   pypi_0    pypi
numpy                     2.0.2                    pypi_0    pypi
nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
nvidia-nvjitlink-cu12     12.6.68                  pypi_0    pypi
nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
openssl                   1.1.1w               h7f8727e_0  
packaging                 24.1                     pypi_0    pypi
pandas                    2.2.2                    pypi_0    pypi
pip                       24.2            py310h06a4308_0  
platformdirs              4.3.2                    pypi_0    pypi
pooch                     1.8.2                    pypi_0    pypi
psutil                    6.0.0                    pypi_0    pypi
pyarrow                   17.0.0                   pypi_0    pypi
pycparser                 2.22                     pypi_0    pypi
python                    3.10.0               h12debd9_5  
python-dateutil           2.9.0.post0              pypi_0    pypi
pytz                      2024.2                   pypi_0    pypi
pyyaml                    6.0.2                    pypi_0    pypi
rapidfuzz                 3.9.7                    pypi_0    pypi
readline                  8.2                  h5eee18b_0  
regex                     2024.7.24                pypi_0    pypi
requests                  2.32.3                   pypi_0    pypi
safetensors               0.4.4                    pypi_0    pypi
scikit-learn              1.5.2                    pypi_0    pypi
scipy                     1.14.1                   pypi_0    pypi
setuptools                72.1.0          py310h06a4308_0  
six                       1.16.0                   pypi_0    pypi
soundfile                 0.12.1                   pypi_0    pypi
soxr                      0.5.0.post1              pypi_0    pypi
sqlite                    3.45.3               h5eee18b_0  
sympy                     1.13.2                   pypi_0    pypi
threadpoolctl             3.5.0                    pypi_0    pypi
tk                        8.6.14               h39e8969_0  
tokenizers                0.13.3                   pypi_0    pypi
torch                     2.4.1                    pypi_0    pypi
tqdm                      4.66.5                   pypi_0    pypi
transformers              4.32.1                   pypi_0    pypi
triton                    3.0.0                    pypi_0    pypi
typing-extensions         4.12.2                   pypi_0    pypi
typing-inspect            0.9.0                    pypi_0    pypi
tzdata                    2024.1                   pypi_0    pypi
urllib3                   2.2.2                    pypi_0    pypi
wheel                     0.43.0          py310h06a4308_0  
xxhash                    3.5.0                    pypi_0    pypi
xz                        5.4.6                h5eee18b_1  
yarl                      1.11.1                   pypi_0    pypi
zlib                      1.2.13               h5eee18b_1

balaji1312 commented 1 month ago

Hi @JiamingZhou777 ,

Thanks for raising this issue:

Could you clarify whether you were attempting to load an OpenAI model directly from the HF Hub or resuming the finetuning process of one of our models? There's a check in modeling_utils.py that might be relevant. It deletes the gradient_checkpointing attribute during finetuning, potentially requiring you to manually set it again. On newer versions of transformers there is also an alternate check that determines if the model was originally created with transformers >4.35.
It's worth double-checking that the whisper class loaded during finetuning is indeed the one from src/models/modeling_whisper.py. While this might not be the primary cause, it could help us narrow down the issue.
I've also tested the code without IterableDataset loading on transformers==4.36, and it appears to be working correctly. If your data size allows, you could try testing with this version, as transformers==4.35 introduced some changes to gradient checkpointing functionality.

Let me know your findings on these points, and we'll continue troubleshooting from there.

JiamingZhou777 commented 1 month ago

Thank you for your response. I have tried the suggested methods, but unfortunately, they didn't work. Could you provide a requirements.txt file? While fine-tuning Hubert, I frequently encounter a "CUDA out of memory" error during the calculation of dev loss. Do you have any suggestions for resolving this? Although I haven't fully run your code, I have cited your paper in mine and submitted it to a conference. I appreciate your assistance!

JiamingZhou777 commented 1 month ago

PS: even with the batch size set to 1 for both the training and development stages.

Diamondfan commented 1 month ago

Hi @JiamingZhou777, thanks for using our code for ASR system development.

For the gradient checkpoint issue, if you look into the error: "AttributeError: 'WhisperForConditionalGeneration' object has no attribute '_set_gradient_checkpointing'. Did you mean: 'is_gradient_checkpointing'?"

It is basically a mismatch between the modeling_util.py and model definition to enable the gradient checkpointing feature, potentially caused by transformer version mismatch. You can change either file to make them consistent. Check the errors in the function call history to get an idea where to make such a change.

In terms of the OOM issue, please make sure you have enough GPU memory for running with HuBERT large model. Otherwise, please use HuBERT-base model first.

Thanks, Ruchao

Diamondfan / SPAPL_KidsASR

Errors occur while fine-tuning the full parameters of Whisper-small #2