Epoch gpu_mem cls reg_box reg_ldm total targets img_size
0%| | 0/580 [00:04<?, ?it/s]
Traceback (most recent call last):
File "/home/xu/WakeNet/train.py", line 220, in
train_model(arg, hyps)
File "/home/xu/WakeNet/train.py", line 126, in train_model
losses = model(ims, gt_boxes, gt_landmarks, process=epoch / epochs)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, kwargs)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, *kwargs)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(input, kwargs)
File "/home/xu/WakeNet/models/WakeNet.py", line 94, in forward
land_pred = torch.cat([self.ldm_head0(features[0]), self.ldm_head1(features[1]), self.ldm_head2(features[2]),
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, kwargs)
File "/home/xu/WakeNet/models/MultiHeads.py", line 143, in forward
x2 = self.convs2(x2)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, *kwargs)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(input, kwargs)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 423, in forward
return self._conv_forward(input, self.weight)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: no kernel image is available for execution on the device
首先非常感谢作者分享了WaveNet的代码,我按照您的教程配置好了环境,但是在运行train.py时遇到了如下问题:
Fail to speed up training via apex.
Namespace(augment=True, backbone='fca34', dataset='SWIM', eval_path='/home/xu/WakeNet/SWIM_Dataset_1.0.0/test.txt', freeze_bn=False, hyp='hyp.py', load=False, multi_scale=False, resume=False, target_size=[768], train_path='/home/xu/WakeNet/SWIM_Dataset_1.0.0/train.txt', training_size=768, weight='/home/xu/WakeNet/models/pretrained/fca34.pth') {'lr0': 0.0001, 'warmup_lr': 1e-05, 'warm_epoch': 1.0, 'num_classes': 1.0, 'epochs': 100.0, 'batch_size': 12.0, 'save_interval': 5.0, 'test_interval': 5.0, 'lambda1': 1.0, 'lambda2': 0.2} Weight loaded. Model Summary: 280 layers, 4.83088e+07 parameters, 4.83088e+07 gradients
0%| | 0/580 [00:04<?, ?it/s] Traceback (most recent call last): File "/home/xu/WakeNet/train.py", line 220, in
train_model(arg, hyps)
File "/home/xu/WakeNet/train.py", line 126, in train_model
losses = model(ims, gt_boxes, gt_landmarks, process=epoch / epochs)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, kwargs)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, *kwargs)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(input, kwargs)
File "/home/xu/WakeNet/models/WakeNet.py", line 94, in forward
land_pred = torch.cat([self.ldm_head0(features[0]), self.ldm_head1(features[1]), self.ldm_head2(features[2]),
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, kwargs)
File "/home/xu/WakeNet/models/MultiHeads.py", line 143, in forward
x2 = self.convs2(x2)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, *kwargs)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(input, kwargs)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 423, in forward
return self._conv_forward(input, self.weight)
File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: no kernel image is available for execution on the device
以下是我环境的一些信息,不知您是否有遇到过这样的问题,非常感谢! OS: ubuntu 16.04 LTS GPU: Mon Nov 29 14:42:29 2021
nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Thu_Jun_11_22:26:38_PDT_2020 Cuda compilation tools, release 11.0, V11.0.194 Build cuda_11.0_bu.TC445_37.28540450_0
Pytorch安装时用的是如下命令: pip install torch==1.7.0+cu110 torchvision==0.8.0+cu110 torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html 也尝试过使用conda安装: conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=11.0 -c pytorch 但都会报错
不知是否是双卡的问题呢?恳请您回复解答一二,感谢!