Training on ScienceQA data:出现问题

today-still-sleep-early commented 4 days ago

在进行Training on ScienceQA data时，出现下面报错是什么情况 Loading checkpoint shards: 100%|██████████████████████████████████████████████| 2/2 [00:33<00:00, 16.56s/it] NEW PARAMETERS obalance False /root/miniconda3/envs/mola/lib/python3.10/site-packages/peft/utils/other.py:102: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead. warnings.warn( TRAINING MOLA Checkpoint ./step2_biology256r_8mbs_no8bit_scale10/adapter_model.bin not found Map: 100%|██████████████████████████████████████████████████████████████| 199/199 [00:00<00:00, 602.33 examples/s] Map: 100%|███████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 76.06 examples/s] Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Did not load optimizer and scheduler ** 0%| | 0/1 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [4,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [7,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [8,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [9,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [10,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [11,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [12,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [13,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [14,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [15,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [16,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [17,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [18,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [19,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [20,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [21,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [22,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [23,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [24,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [25,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [26,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [27,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [28,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [29,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion t >= 0 && t < n_classes failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [31,0,0] Assertion t >= 0 && t < n_classes failed. Traceback (most recent call last): File "/home/loraMoE/liuwp/MoLA/mola_training.py", line 302, in fire.Fire(train) File "/root/miniconda3/envs/mola/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/root/miniconda3/envs/mola/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/root/miniconda3/envs/mola/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(varargs, kwargs) File "/home/loraMoE/liuwp/MoLA/mola_training.py", line 292, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint, resume_from_checkpoint_optim=False) File "/home/loraMoE/liuwp/MoLA/src/mola_trainer_hacked.py", line 1637, in train return inner_training_loop( File "/home/loraMoE/liuwp/MoLA/src/mola_trainer_hacked.py", line 1911, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/loraMoE/liuwp/MoLA/src/mola_trainer_hacked.py", line 2659, in training_step loss = self.compute_loss(model, inputs) File "/home/loraMoE/liuwp/MoLA/src/mola_trainer_hacked.py", line 2691, in compute_loss outputs = model(inputs) File "/root/miniconda3/envs/mola/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/loraMoE/liuwp/MoLA/src/mola_peft_model_hacked.py", line 980, in forward return self.base_model( File "/root/miniconda3/envs/mola/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/root/miniconda3/envs/mola/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(args, kwargs) File "/home/loraMoE/liuwp/MoLA/src/mola_modeling_llama_hacked.py", line 1315, in forward aux_loss = load_balancing_loss_func( File "/home/loraMoE/liuwp/MoLA/src/mola_modeling_llama_hacked.py", line 58, in load_balancing_loss_func expert_mask = torch.nn.functional.one_hot(selected_experts, num_experts[layer_i]) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. 0%| | 0/1 [00:06<?, ?it/s]

GCYZSL commented 2 days ago

根据您的log 我感觉像是CUDA的问题，有可能是因为Transformers包和您的CUDA或者GPU不适配。

today-still-sleep-early commented 2 days ago

您好，CUDA需要哪些版本呢

GCYZSL commented 2 days ago

这可能取决于您的GPU和系统版本，我不是很确定。最简便的是，您可以多尝试几个版本的Transformers。如果依然不可以，可能就得参考网上的资料研究下cuda version和Transformers version了。

today-still-sleep-early commented 2 days ago

您是否有推荐的更加详细的实验配置环境呢

GCYZSL commented 2 days ago

我们的requirements.txt 已经很详细了，如果您是A6000或者A100应该是肯定没有问题的。对于更新的卡和以前的卡，我们可能没有条件去测试，抱歉！

GCYZSL / MoLA

Training on ScienceQA data:出现问题 #18