facebookresearch / adaptive_teacher

This repo provides the source code for "Cross-Domain Adaptive Teacher for Object Detection".
Other
180 stars 35 forks source link

RuntimeError: CUDA error: no kernel image is available for execution on the device #65

Open Manjuphoenix opened 1 year ago

Manjuphoenix commented 1 year ago

There was an error while reproducing the code on the machine with the following spec: Ubuntu: 20.04 GPU: Nvidia A6000 python version: 3.8.0

pip list:

nvcc --version: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Mon_Oct_24_19:12:58_PDT_2022 Cuda compilation tools, release 12.0, V12.0.76 Build cuda_12.0.r12.0/compiler.31968024_0

Didn't change much of the config for cityscapes to foggy cityscapes.

Error message:

DAobjTwoStagePseudoLabGeneralizedRCNN( (backbone): vgg_backbone( (vgg0): Sequential( (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace=True) (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) ) (vgg1): Sequential( (0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace=True) (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) ) (vgg2): Sequential( (0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace=True) (6): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (7): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (8): ReLU(inplace=True) (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) ) (vgg3): Sequential( (0): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace=True) (6): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (7): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (8): ReLU(inplace=True) (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) ) (vgg4): Sequential( (0): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace=True) (6): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (7): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (8): ReLU(inplace=True) (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) ) ) (proposal_generator): PseudoLabRPN( (rpn_head): StandardRPNHead( (conv): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1) (activation): ReLU() ) (objectness_logits): Conv2d(512, 15, kernel_size=(1, 1), stride=(1, 1)) (anchor_deltas): Conv2d(512, 60, kernel_size=(1, 1), stride=(1, 1)) ) (anchor_generator): DefaultAnchorGenerator( (cell_anchors): BufferList() ) ) (roi_heads): StandardROIHeadsPseudoLab( (box_pooler): ROIPooler( (level_poolers): ModuleList( (0): ROIAlign(output_size=(7, 7), spatial_scale=0.03125, sampling_ratio=0, aligned=True) ) ) (box_head): FastRCNNConvFCHead( (flatten): Flatten(start_dim=1, end_dim=-1) (fc1): Linear(in_features=25088, out_features=1024, bias=True) (fc_relu1): ReLU() (fc2): Linear(in_features=1024, out_features=1024, bias=True) (fc_relu2): ReLU() ) (box_predictor): FastRCNNOutputLayers( (cls_score): Linear(in_features=1024, out_features=9, bias=True) (bbox_pred): Linear(in_features=1024, out_features=32, bias=True) ) ) (D_img): FCDiscriminator_img( (conv1): Conv2d(512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (conv2): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (conv3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (classifier): Conv2d(128, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (leaky_relu): LeakyReLU(negative_slope=0.2, inplace=True) ) ) [06/13 22:42:17 fvcore.common.checkpoint]: No checkpoint found. Initializing model from scratch Exception during training: Traceback (most recent call last): File "/four_tb/manjunath/adaptive_teacher/adapteacher/engine/trainer.py", line 404, in train_loop self.run_step_full_semisup() File "/four_tb/manjunath/adaptive_teacher/adapteacher/engine/trainer.py", line 512, in run_step_full_semisup recorddict, , , = self.model( File "/home/user/anaconda3/envs/fbadapt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/four_tb/manjunath/adaptive_teacher/adapteacher/modeling/meta_arch/rcnn.py", line 207, in forward images = self.preprocess_image(batched_inputs) File "/home/user/anaconda3/envs/fbadapt/lib/python3.8/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 225, in preprocess_image images = [(x - self.pixel_mean) / self.pixel_std for x in images] File "/home/user/anaconda3/envs/fbadapt/lib/python3.8/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 225, in images = [(x - self.pixel_mean) / self.pixel_std for x in images] RuntimeError: CUDA error: no kernel image is available for execution on the device [06/13 22:42:18 d2.engine.hooks]: Total training time: 0:00:00 (0:00:00 on hooks) [06/13 22:42:18 d2.utils.events]: iter: 0 lr: N/A max_mem: 368M Traceback (most recent call last): File "train_net.py", line 73, in launch( File "/home/user/anaconda3/envs/fbadapt/lib/python3.8/site-packages/detectron2/engine/launch.py", line 82, in launch main_func(args) File "train_net.py", line 66, in main return trainer.train() File "/four_tb/manjunath/adaptive_teacher/adapteacher/engine/trainer.py", line 386, in train self.train_loop(self.start_iter, self.max_iter) File "/four_tb/manjunath/adaptive_teacher/adapteacher/engine/trainer.py", line 404, in train_loop self.run_step_full_semisup() File "/four_tb/manjunath/adaptive_teacher/adapteacher/engine/trainer.py", line 512, in run_step_full_semisup recorddict, , , = self.model( File "/home/user/anaconda3/envs/fbadapt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/four_tb/manjunath/adaptive_teacher/adapteacher/modeling/meta_arch/rcnn.py", line 207, in forward images = self.preprocess_image(batched_inputs) File "/home/user/anaconda3/envs/fbadapt/lib/python3.8/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 225, in preprocess_image images = [(x - self.pixel_mean) / self.pixel_std for x in images] File "/home/user/anaconda3/envs/fbadapt/lib/python3.8/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 225, in images = [(x - self.pixel_mean) / self.pixel_std for x in images] RuntimeError: CUDA error: no kernel image is available for execution on the device

Thought it was cuda version issue, but running the same on docker container (cuda version 10.1) on same machine gave the same error

hellowangqian commented 1 year ago

Usually this error is caused by the incompatibility between your cuda version used to compile pytorch and your GPU. A6000 may not support torch with cu10.1, try installing torch version with cu11.x