Open today-still-sleep-early opened 4 days ago
根据您的log 我感觉像是CUDA的问题,有可能是因为Transformers包和您的CUDA或者GPU不适配。
您好,CUDA需要哪些版本呢
这可能取决于您的GPU和系统版本,我不是很确定。最简便的是,您可以多尝试几个版本的Transformers。如果依然不可以,可能就得参考网上的资料研究下cuda version和Transformers version了。
您是否有推荐的更加详细的实验配置环境呢
我们的requirements.txt 已经很详细了,如果您是A6000或者A100应该是肯定没有问题的。对于更新的卡和以前的卡,我们可能没有条件去测试,抱歉!
在进行Training on ScienceQA data时,出现下面报错是什么情况 Loading checkpoint shards: 100%|██████████████████████████████████████████████| 2/2 [00:33<00:00, 16.56s/it] NEW PARAMETERS obalance False /root/miniconda3/envs/mola/lib/python3.10/site-packages/peft/utils/other.py:102: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead. warnings.warn( TRAINING MOLA Checkpoint ./step2_biology256r_8mbs_no8bit_scale10/adapter_model.bin not found Map: 100%|██████████████████████████████████████████████████████████████| 199/199 [00:00<00:00, 602.33 examples/s] Map: 100%|███████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 76.06 examples/s] Using the
fire.Fire(train)
File "/root/miniconda3/envs/mola/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/envs/mola/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/envs/mola/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(varargs, kwargs)
File "/home/loraMoE/liuwp/MoLA/mola_training.py", line 292, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint, resume_from_checkpoint_optim=False)
File "/home/loraMoE/liuwp/MoLA/src/mola_trainer_hacked.py", line 1637, in train
return inner_training_loop(
File "/home/loraMoE/liuwp/MoLA/src/mola_trainer_hacked.py", line 1911, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/loraMoE/liuwp/MoLA/src/mola_trainer_hacked.py", line 2659, in training_step
loss = self.compute_loss(model, inputs)
File "/home/loraMoE/liuwp/MoLA/src/mola_trainer_hacked.py", line 2691, in compute_loss
outputs = model(inputs)
File "/root/miniconda3/envs/mola/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/home/loraMoE/liuwp/MoLA/src/mola_peft_model_hacked.py", line 980, in forward
return self.base_model(
File "/root/miniconda3/envs/mola/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/root/miniconda3/envs/mola/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(args, kwargs)
File "/home/loraMoE/liuwp/MoLA/src/mola_modeling_llama_hacked.py", line 1315, in forward
aux_loss = load_balancing_loss_func(
File "/home/loraMoE/liuwp/MoLA/src/mola_modeling_llama_hacked.py", line 58, in load_balancing_loss_func
expert_mask = torch.nn.functional.one_hot(selected_experts, num_experts[layer_i])
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
0%| | 0/1 [00:06<?, ?it/s]
WANDB_DISABLED
environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Did not load optimizer and scheduler ** 0%| | 0/1 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [3,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [4,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [5,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [7,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [8,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [9,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [10,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [11,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [12,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [13,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [14,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [15,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [16,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [17,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [18,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [19,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [20,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [21,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [22,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [23,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [24,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [25,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [26,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [27,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [28,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [29,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertiont >= 0 && t < n_classes
failed. ../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [31,0,0] Assertiont >= 0 && t < n_classes
failed. Traceback (most recent call last): File "/home/loraMoE/liuwp/MoLA/mola_training.py", line 302, in