Lilytopia / WakeNet

A CNN-based optical image ship wake detector.
MIT License
47 stars 11 forks source link

有关运行环境的一些问题 #1

Closed libzzluo closed 2 years ago

libzzluo commented 2 years ago

首先非常感谢作者分享了WaveNet的代码,我按照您的教程配置好了环境,但是在运行train.py时遇到了如下问题:

Fail to speed up training via apex.

Namespace(augment=True, backbone='fca34', dataset='SWIM', eval_path='/home/xu/WakeNet/SWIM_Dataset_1.0.0/test.txt', freeze_bn=False, hyp='hyp.py', load=False, multi_scale=False, resume=False, target_size=[768], train_path='/home/xu/WakeNet/SWIM_Dataset_1.0.0/train.txt', training_size=768, weight='/home/xu/WakeNet/models/pretrained/fca34.pth') {'lr0': 0.0001, 'warmup_lr': 1e-05, 'warm_epoch': 1.0, 'num_classes': 1.0, 'epochs': 100.0, 'batch_size': 12.0, 'save_interval': 5.0, 'test_interval': 5.0, 'lambda1': 1.0, 'lambda2': 0.2} Weight loaded. Model Summary: 280 layers, 4.83088e+07 parameters, 4.83088e+07 gradients

 Epoch   gpu_mem       cls   reg_box   reg_ldm     total   targets  img_size

0%| | 0/580 [00:04<?, ?it/s] Traceback (most recent call last): File "/home/xu/WakeNet/train.py", line 220, in train_model(arg, hyps) File "/home/xu/WakeNet/train.py", line 126, in train_model losses = model(ims, gt_boxes, gt_landmarks, process=epoch / epochs) File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/_utils.py", line 428, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, *kwargs) File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/xu/WakeNet/models/WakeNet.py", line 94, in forward land_pred = torch.cat([self.ldm_head0(features[0]), self.ldm_head1(features[1]), self.ldm_head2(features[2]), File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/home/xu/WakeNet/models/MultiHeads.py", line 143, in forward x2 = self.convs2(x2) File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 423, in forward return self._conv_forward(input, self.weight) File "/home/xu/anaconda3/envs/wakenet/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward self.padding, self.dilation, self.groups) RuntimeError: CUDA error: no kernel image is available for execution on the device


以下是我环境的一些信息,不知您是否有遇到过这样的问题,非常感谢! OS: ubuntu 16.04 LTS GPU: Mon Nov 29 14:42:29 2021
image

nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Thu_Jun_11_22:26:38_PDT_2020 Cuda compilation tools, release 11.0, V11.0.194 Build cuda_11.0_bu.TC445_37.28540450_0

Pytorch安装时用的是如下命令: pip install torch==1.7.0+cu110 torchvision==0.8.0+cu110 torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html 也尝试过使用conda安装: conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=11.0 -c pytorch 但都会报错

不知是否是双卡的问题呢?恳请您回复解答一二,感谢!