facebookresearch / MaskFormer

Per-Pixel Classification is Not All You Need for Semantic Segmentation (NeurIPS 2021, spotlight)
Other
1.35k stars 152 forks source link

Unable to train the model #69

Open kagawa588 opened 2 years ago

kagawa588 commented 2 years ago

Hi,

Thanks for your great work! I try to train the model myself recently, but I found that it takes so long to transfer the model from cpu to gpu (about an hour) and then it failed. Could you pls give me any suggestions? Did I do something wrong?

Thanks in advance!

My environment is below:

sys.platform linux Python 3.7.0 (default, Oct 9 2018, 10:31:47) [GCC 7.3.0] numpy 1.21.5 detectron2 0.6 @/home/mu/anaconda3/envs/maskformer/lib/python3.7/site-packages/detectron2 Compiler GCC 7.3 CUDA compiler CUDA 10.2 detectron2 arch flags 3.7, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5 DETECTRON2_ENV_MODULE PyTorch 1.8.2 @/home/mu/anaconda3/envs/maskformer/lib/python3.7/site-packages/torch PyTorch debug build False GPU available Yes GPU 0 NVIDIA GeForce RTX 3080 Laptop GPU (arch=8.6) Driver version 510.60.02 CUDA_HOME /usr/local/cuda Pillow 9.2.0 torchvision 0.9.2 @/home/mu/anaconda3/envs/maskformer/lib/python3.7/site-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5 fvcore 0.1.5.post20220512 iopath 0.1.9 cv2 4.6.0


The error is below:

res4.9.conv3.norm.num_batches_tracked res5.0.conv1.norm.num_batches_tracked res5.0.conv2.norm.num_batches_tracked res5.0.conv3.norm.num_batches_tracked res5.0.shortcut.norm.num_batches_tracked res5.1.conv1.norm.num_batches_tracked res5.1.conv2.norm.num_batches_tracked res5.1.conv3.norm.num_batches_tracked res5.2.conv1.norm.num_batches_tracked res5.2.conv2.norm.num_batches_tracked res5.2.conv3.norm.num_batches_tracked stem.conv1.norm.num_batches_tracked stem.conv2.norm.num_batches_tracked stem.conv3.norm.num_batches_tracked stem.fc.{bias, weight} [08/21 20:18:39 d2.engine.train_loop]: Starting training from iteration 0 ERROR [08/21 20:20:24 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 494, in run_step self._trainer.run_step() File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 285, in run_step losses.backward() File "/cloud/maskformer/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/cloud/maskformer/lib/python3.7/site-packages/torch/autograd/init.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: Unable to find a valid cuDNN algorithm to run convolution [08/21 20:20:24 d2.engine.hooks]: Total training time: 0:01:45 (0:00:00 on hooks) [08/21 20:20:24 d2.utils.events]: iter: 0 lr: N/A max_mem: 5604M Traceback (most recent call last): File "train_net.py", line 270, in args=(args,), File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/launch.py", line 82, in launch main_func(*args) File "train_net.py", line 258, in main return trainer.train() File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 484, in train super().train(self.start_iter, self.max_iter) File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 494, in run_step self._trainer.run_step() File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 285, in run_step losses.backward() File "/cloud/maskformer/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/cloud/maskformer/lib/python3.7/site-packages/torch/autograd/init.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: Unable to find a valid cuDNN algorithm to run convolution