fe1ixxu / ALMA

State-of-the-art LLM-based translation models.
MIT License
352 stars 26 forks source link

Error on running the evaluation command #39

Open Amrit-Bhaskar-abhask10 opened 3 months ago

Amrit-Bhaskar-abhask10 commented 3 months ago

Environment: I did all the processes to set up the environment as given in the README.md. I have 4 GPUs in the setup.

image

The quick start example given for Chinese to English translation works fine.

However, when I run the below command: accelerate launch --config_file configs/deepspeed_eval_config_bf16.yaml run_llmmt.py --model_name_or_path haoranxu/ALMA-13B-R --do_predict --low_cpu_mem_usage --language_pairs en-cs,cs-en --mmt_data_path ./human_written_data/ --per_device_eval_batch_size 1 --output_dir ./your_output_dir/ --predict_with_generate --max_new_tokens 256 --max_source_length 256 --bf16 --seed 42 --num_beams 5 --overwrite_cache --overwrite_output_dir

I am getting an error:


CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

^^          File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/cuda/__init__.py", line 408, in set_device
        ^
^^ ^                  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/accelerate/state.py", line 236, in __init__
^^^^    torch._C._cuda_setDevice(device)
      ^^  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/accelerate/state.py", line 236, in __init__
    ^^^^    RuntimeError      ^^    torch.cuda.set_device(self.device)
^^^^^^^^: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
  ^^^^^^torch.cuda.set_device(self.device)
  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/cuda/__init__.py", line 408, in set_device
^^^^^^^^
  ^^^^^^  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/cuda/__init__.py", line 408, in set_device
^^^^^^^^  ^^^^^^    torch._C._cuda_setDevice(device)    ^
^^^^^^^^^^^^^^
torch._C._cuda_setDevice(device)

^^^^^^^^^^^
RuntimeError  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/accelerate/state.py", line 236, in __init__
RuntimeError: ^^^^^^^^^^: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
      File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/accelerate/state.py", line 236, in __init__
CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

^
^
^^^
^
  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/accelerate/state.py", line 236, in __init__

torch.cuda.set_device(self.device)
      File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/accelerate/state.py", line 236, in __init__
^^  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/accelerate/state.py", line 236, in __init__
torch.cuda.set_device(self.device)
  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/accelerate/state.py", line 236, in __init__
    ^^  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/accelerate/state.py", line 236, in __init__
    torch.cuda.set_device(self.device)  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/cuda/__init__.py", line 408, in set_device
        torch.cuda.set_device(self.device)
^

    torch.cuda.set_device(self.device)
  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/cuda/__init__.py", line 408, in set_device
torch.cuda.set_device(self.device)
  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/cuda/__init__.py", line 408, in set_device```
fe1ixxu commented 3 months ago

Thanks for your interest! It looks like the error only happens on my side with only one GPU is visible. You may want to try:

  1. see the visible GPU ids echo ${CUDA_VISIBLE_DEVICES}
  2. re-run the code by passing CUDA_LAUNCH_BLOCKING=1
  3. Run the evaluation without accelerate, please see an example here
Amrit-Bhaskar-abhask10 commented 3 months ago

Thanks, @fe1ixxu for the suggestion. I am trying now without accelerate.

For the CPO fine-tuning, it is suggested in README to run, bash runs/cpo_ft.sh ${your_output_dir}

I replaced accelerate in the file. And now the file is:

OUTPUT_DIR=${1:-"./amr_cpo_ft"}
pairs=${2:-"cs-en,en-cs"}
export HF_DATASETS_CACHE=".cache/huggingface_cache/datasets"
export TRANSFORMERS_CACHE=".cache/models/"
# random port between 30000 and 50000
port=$(( RANDOM % (50000 - 30000 + 1 ) + 30000 ))

python run_cpo_llmmt.py \
                 --model_name_or_path haoranxu/ALMA-13B-Pretrain \
                     --tokenizer_name haoranxu/ALMA-13B-Pretrain \
                         --peft_model_id  haoranxu/ALMA-13B-Pretrain-LoRA \
                             --cpo_scorer kiwi_xcomet \
                                 --beta 0.1 \
                                     --use_peft \
                                         --use_fast_tokenizer False \
                                             --cpo_data_path  haoranxu/ALMA-R-Preference \
                                                 --do_train \
                                                     --language_pairs ${pairs} \
                                                         --low_cpu_mem_usage \
                                                             --bf16 \
                                                                 --learning_rate 1e-4 \
                                                                     --weight_decay 0.01 \
                                                                         --gradient_accumulation_steps 1 \
                                                                             --lr_scheduler_type inverse_sqrt \
                                                                                 --warmup_ratio 0.01 \
                                                                                     --ignore_pad_token_for_loss \
                                                                                         --ignore_prompt_token_for_loss \
                                                                                             --per_device_train_batch_size 2 \
                                                                                                 --evaluation_strategy no \
                                                                                                     --save_strategy steps \
                                                                                                         --save_total_limit 1 \
                                                                                                             --logging_strategy steps \
                                                                                                                 --logging_steps 0.05 \
                                                                                                                     --output_dir ${OUTPUT_DIR} \
                                                                                                                         --num_train_epochs 1 \
                                                                                                                             --prediction_loss_only \
                                                                                                                                 --max_new_tokens 256 \
                                                                                                                                     --max_source_length 256 \
                                                                                                                                         --max_prompt_length 256 \
                                                                                                                                             --max_length 512 \
                                                                                                                                                 --seed 42 \
                                                                                                                                                     --overwrite_output_dir \
                                                                                                                                                         --report_to none \
                                                                                                                                                             --overwrite_cache

After running this file,

I am getting the following error:

  04/02/2024 03:40:46 - WARNING - accelerate.utils.other - Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Traceback (most recent call last):
  File "/home/amrbhask/ALMA/run_cpo_llmmt.py", line 149, in <module>
    main()
  File "/home/amrbhask/ALMA/run_cpo_llmmt.py", line 120, in main
    trainer = CPOTrainer(
              ^^^^^^^^^^^
  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/trl/trainer/cpo_trainer.py", line 281, in __init__
    super().__init__(
  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/transformers/trainer.py", line 495, in __init__
    self._move_model_to_device(model, args.device)
  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/transformers/trainer.py", line 736, in _move_model_to_device
    model = model.to(device)
            ^^^^^^^^^^^^^^^^
  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1152, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/nn/modules/module.py", line 825, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "/home/amrbhask/miniconda3/envs/alma-r/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1150, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB. GPU 0 has a total capacity of 39.56 GiB of which 160.81 MiB is free. Including non-PyTorch memory, this process has 39.40 GiB memory in use. Of the allocated memory 38.98 GiB is allocated by PyTorch, and 13.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Please let me know how to resolve this error.

The output of the commands: echo ${CUDA_VISIBLE_DEVICES} is 0,1,2,3

And for this, nvidia-smi is

image