Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.55k stars 242 forks source link

RuntimeError: CUDA error: device-side assert triggered #216

Closed Darren-greenhand closed 1 year ago

Darren-greenhand commented 1 year ago

Hello Otter team!

When I try to train the luodian/OTTER-9B-INIT model with SN and SD datasets using the instruction_following.py I met some problems

I can't use the SN datasets because image_ids int SN_instructions.json don't match SN.json

In SD datasets I didn't meet this problem but RuntimeError: CUDA error: device-side assert triggered occurs

I turn to stackoverfollow but most of the solutions are for classification problem QAQ I check the vocab_size and it is correct

how i start training:

export PYTHONPATH=.

accelerate launch --config_file  ./pipeline/accelerate_configs/accelerate_config_fsdp.yaml \
instruction_following.py \
--pretrained_model_name_or_path  /tf/ckpt/OTTER-LLaMA7B-Init  \
--external_save_dir /tf/finetuned \
--mimicit_path  /tf/data/SD/SD_instructions.json \
--images_path  /tf/data/SD/SD.json \
--train_config_ic_path  /tf/data/SD/SD_train.json \
--batch_size  4 \
--num_epochs  9 \
--report_to_wandb \
--wandb_entity  ntu-slab \
--run_name  OTTER-LLaMA7B-densecaption \
--wandb_project  OTTER-LLaMA7B \
--workers  1 \
--lr_scheduler  cosine \
--learning_rate  1e-5 \
--warmup_steps_ratio  0.01

the detailed stack:

&& t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [16,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [17,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [18,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [19,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [20,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [21,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [22,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [23,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [24,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [25,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [26,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [27,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [28,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [29,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
  0%|                                                                                                                                                                                          | 0/35982 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/tf/Otter/instruction_following.py", line 634, in <module>
    main()
  File "/tf/Otter/instruction_following.py", line 579, in main
    train_one_epoch(
  File "/tf/Otter/instruction_following.py", line 115, in train_one_epoch
    loss_mimicit = model(
  File "/tf/anaconda3/envs/otter/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/tf/anaconda3/envs/otter/lib/python3.9/site-packages/accelerate/utils/operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
  File "/tf/anaconda3/envs/otter/lib/python3.9/site-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/tf/anaconda3/envs/otter/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/tf/anaconda3/envs/otter/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/tf/Otter/otter/modeling_otter.py", line 921, in forward
    output = self.lang_encoder(
  File "/tf/anaconda3/envs/otter/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/tf/Otter/otter/modeling_otter.py", line 511, in forward
    return super().forward(*input, **kwargs)  # Call the other parent's forward method
  File "/tf/Otter/xformers_model/llama.py", line 722, in forward
    loss = loss_fct(shift_logits, shift_labels)
  File "/tf/anaconda3/envs/otter/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/tf/anaconda3/envs/otter/lib/python3.9/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/tf/anaconda3/envs/otter/lib/python3.9/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /tf/Otter/wandb/offline-run-20230720_144301-a6hakqug
wandb: Find logs at: ./wandb/offline-run-20230720_144301-a6hakqug/logs
Traceback (most recent call last):
  File "/tf/anaconda3/envs/otter/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/tf/anaconda3/envs/otter/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/tf/anaconda3/envs/otter/lib/python3.9/site-packages/accelerate/commands/launch.py", line 979, in launch_command
    simple_launcher(args)
  File "/tf/anaconda3/envs/otter/lib/python3.9/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/tf/anaconda3/envs/otter/bin/python', 'instruction_following.py', '--pretrained_model_name_or_path', '/tf/ckpt/OTTER-LLaMA7B-Init', '--external_save_dir', '/tf/finetuned', '--mimicit_path', '/tf/data/SD/SD_instructions.json', '--images_path', '/tf/data/SD/SD.json', '--train_config_ic_path', '/tf/data/SD/SD_train.json', '--batch_size', '4', '--num_epochs', '9', '--report_to_wandb', '--wandb_entity', 'ntu-slab', '--run_name', 'OTTER-LLaMA7B-densecaption', '--wandb_project', 'OTTER-LLaMA7B', '--workers', '1', '--lr_scheduler', 'cosine', '--learning_rate', '1e-5', '--warmup_steps_ratio', '0.01']' returned non-zero exit status 1.
Luodian commented 1 year ago

You may refer the PR #217 and checkout to the new fixed branch~

Luodian commented 1 year ago

--train_config_ic_path /tf/data/SD/SD_train.json \ should be --train_config_path /tf/data/SD/SD_train.json \

It's our bug since we are aiming to divide the training datasets into different groups. But missing to add train_config_path.

Darren-greenhand commented 1 year ago

thanks a lot for your help!