Open Amrit-Bhaskar-abhask10 opened 8 months ago
Thanks for your interest! It looks like the error only happens on my side with only one GPU is visible. You may want to try:
echo ${CUDA_VISIBLE_DEVICES}
CUDA_LAUNCH_BLOCKING=1
accelerate
, please see an example hereThanks, @fe1ixxu for the suggestion. I am trying now without accelerate
.
For the CPO fine-tuning, it is suggested in README to run, bash runs/cpo_ft.sh ${your_output_dir}
I replaced accelerate
in the file. And now the file is:
OUTPUT_DIR=${1:-"./amr_cpo_ft"}
pairs=${2:-"cs-en,en-cs"}
export HF_DATASETS_CACHE=".cache/huggingface_cache/datasets"
export TRANSFORMERS_CACHE=".cache/models/"
# random port between 30000 and 50000
port=$(( RANDOM % (50000 - 30000 + 1 ) + 30000 ))
python run_cpo_llmmt.py \
--model_name_or_path haoranxu/ALMA-13B-Pretrain \
--tokenizer_name haoranxu/ALMA-13B-Pretrain \
--peft_model_id haoranxu/ALMA-13B-Pretrain-LoRA \
--cpo_scorer kiwi_xcomet \
--beta 0.1 \
--use_peft \
--use_fast_tokenizer False \
--cpo_data_path haoranxu/ALMA-R-Preference \
--do_train \
--language_pairs ${pairs} \
--low_cpu_mem_usage \
--bf16 \
--learning_rate 1e-4 \
--weight_decay 0.01 \
--gradient_accumulation_steps 1 \
--lr_scheduler_type inverse_sqrt \
--warmup_ratio 0.01 \
--ignore_pad_token_for_loss \
--ignore_prompt_token_for_loss \
--per_device_train_batch_size 2 \
--evaluation_strategy no \
--save_strategy steps \
--save_total_limit 1 \
--logging_strategy steps \
--logging_steps 0.05 \
--output_dir ${OUTPUT_DIR} \
--num_train_epochs 1 \
--prediction_loss_only \
--max_new_tokens 256 \
--max_source_length 256 \
--max_prompt_length 256 \
--max_length 512 \
--seed 42 \
--overwrite_output_dir \
--report_to none \
--overwrite_cache
After running this file,
I am getting the following error:
04/02/2024 03:40:46 - WARNING - accelerate.utils.other - Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Traceback (most recent call last):
File "/home/amrbhask/ALMA/run_cpo_llmmt.py", line 149, in <module>
main()
File "/home/amrbhask/ALMA/run_cpo_llmmt.py", line 120, in main
trainer = CPOTrainer(
^^^^^^^^^^^
File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/trl/trainer/cpo_trainer.py", line 281, in __init__
super().__init__(
File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/transformers/trainer.py", line 495, in __init__
self._move_model_to_device(model, args.device)
File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/transformers/trainer.py", line 736, in _move_model_to_device
model = model.to(device)
^^^^^^^^^^^^^^^^
File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1152, in to
return self._apply(convert)
^^^^^^^^^^^^^^^^^^^^
File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/nn/modules/module.py", line 825, in _apply
param_applied = fn(param)
^^^^^^^^^
File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1150, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB. GPU 0 has a total capacity of 39.56 GiB of which 160.81 MiB is free. Including non-PyTorch memory, this process has 39.40 GiB memory in use. Of the allocated memory 38.98 GiB is allocated by PyTorch, and 13.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Please let me know how to resolve this error.
The output of the commands:
echo ${CUDA_VISIBLE_DEVICES}
is 0,1,2,3
And for this, nvidia-smi
is
Environment: I did all the processes to set up the environment as given in the README.md. I have 4 GPUs in the setup.
The quick start example given for Chinese to English translation works fine.
However, when I run the below command:
accelerate launch --config_file configs/deepspeed_eval_config_bf16.yaml run_llmmt.py --model_name_or_path haoranxu/ALMA-13B-R --do_predict --low_cpu_mem_usage --language_pairs en-cs,cs-en --mmt_data_path ./human_written_data/ --per_device_eval_batch_size 1 --output_dir ./your_output_dir/ --predict_with_generate --max_new_tokens 256 --max_source_length 256 --bf16 --seed 42 --num_beams 5 --overwrite_cache --overwrite_output_dir
I am getting an error: