COCOStuff RuntimeError: CUDA error: an illegal memory access was encountered

wsy588 commented 2 years ago

hi: Thanks for your great work. When I train BiSeNetv2 with COCOStuff there is no problem. But when I change the number of categories from 171 to 10, I meet RuntimeError: CUDA error: an illegal memory access was encountered. And the Traceback is as follows:

Traceback (most recent call last):
  File "tools/train_amp.py", line 205, in <module>
    main()
  File "tools/train_amp.py", line 201, in main
    train()
  File "tools/train_amp.py", line 158, in train
    loss_pre = criteria_pre(logits, lb)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/BiSeNet/./lib/ohem_ce_loss.py", line 38, in forward
    loss_hard = loss[loss > self.thresh]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::Error'
  what():  NCCL error in: /opt/conda/conda-bld/pytorch_1646755903507/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:172, unhandled cuda error, NCCL version 21.0.3
Process Group destroyed on rank 0
Exception raised from ncclCommAbort at /opt/conda/conda-bld/pytorch_1646755903507/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:172 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f4b63a891bd in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x68 (0x7f4b63a85838 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0xb6528e (0x7f4ba193b28e in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x123 (0x7f4ba191cd53 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x9 (0x7f4ba191cfd9 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #5: <unknown function> + 0x7e4916 (0x7f4be6fc8916 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x7ca433 (0x7f4be6fae433 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x1e32c6 (0x7f4be69c72c6 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x1e488e (0x7f4be69c888e in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x10d098 (0x55b182896098 in /opt/conda/bin/python)
frame #10: <unknown function> + 0x10fbcc (0x55b182898bcc in /opt/conda/bin/python)
frame #11: PyDict_Clear + 0x14b (0x55b182899f6b in /opt/conda/bin/python)
frame #12: <unknown function> + 0x110ff9 (0x55b182899ff9 in /opt/conda/bin/python)
frame #13: <unknown function> + 0x130246 (0x55b1828b9246 in /opt/conda/bin/python)
frame #14: _PyGC_CollectNoFail + 0x2a (0x55b1829c3a2a in /opt/conda/bin/python)
frame #15: PyImport_Cleanup + 0x2ce (0x55b182974e4e in /opt/conda/bin/python)
frame #16: Py_FinalizeEx + 0x79 (0x55b1829db4f9 in /opt/conda/bin/python)
frame #17: Py_RunMain + 0x1bc (0x55b1829de87c in /opt/conda/bin/python)
frame #18: Py_BytesMain + 0x39 (0x55b1829dec69 in /opt/conda/bin/python)
frame #19: __libc_start_main + 0xe7 (0x7f4c22426c87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #20: <unknown function> + 0x1f7427 (0x55b182980427 in /opt/conda/bin/python)

I change the n_cats in config/bisenetv2_coco.py as follows:

cfg = dict(
    model_type='bisenetv2',
    n_cats=10,
    num_aux_heads=4,
    lr_start=5e-3,
    weight_decay=1e-4,
    warmup_iters=1000,
    max_iter=180000,
    dataset='CocoStuff',
    im_root='./datasets/coco',
    train_im_anns='./datasets/coco/train.txt',
    val_im_anns='./datasets/coco/val.txt',
    scales=[0.75, 2.],
    cropsize=[480, 480],
    eval_crop=[480, 480],
    eval_scales=[0.5, 0.75, 1, 1.25, 1.5, 1.75],
    ims_per_gpu=2,
    eval_ims_per_gpu=1,
    use_fp16=True,
    use_sync_bn=True,
    respth='./res',
)

And I change self.n_cats and remain in lib/data/coco.py as follows:

class CocoStuff(BaseDataset):

    def __init__(self, dataroot, annpath, trans_func=None, mode='train'):
        super(CocoStuff, self).__init__(
                dataroot, annpath, trans_func, mode)
        self.n_cats = 10 # 91 stuff, 91 thing, 11 of thing have no annos
        self.lb_ignore = 255

        ## label mapping, remove non-existing labels
        # missing = [11, 25, 28, 29, 44, 65, 67, 68, 70, 82, 90]
        # remain = [ind for ind in range(182) if not ind in missing]
        remain = [ind for ind in range(10) ]
        self.lb_map = np.arange(256)
        for ind in remain:
            self.lb_map[ind] = remain.index(ind)
            print(self.lb_map[ind] )

        self.to_tensor = T.ToTensor(
            mean=(0.46962251, 0.4464104,  0.40718787), # coco, rgb
            std=(0.27469736, 0.27012361, 0.28515933),
        )

My docker environment is: ubuntu18.04 RTX 3060 Driver Version: 510.73.05 pytorch 1.11.0 cuda 11.3 cudnn 8 python 3.8

Thanks a lot if anyone can help me.

CoinCheung commented 2 years ago

Are you still using coco dataset or your personal dataset with 10 categories?

wsy588 commented 2 years ago

Are you still using coco dataset or your personal dataset with 10 categories?

I still use coco dataset. Because I think there are some categories I don't need in coco dataset and 171 categories are hard for me to train.

CoinCheung commented 2 years ago

So your label file has category from 0-171,, but your model category is only 10, which is the cause of your problem. You can change the label values of categories you do not care into 255 which will be ignored.

wsy588 commented 2 years ago

So your label file has category from 0-171,, but your model category is only 10, which is the cause of your problem. You can change the label values of categories you do not care into 255 which will be ignored.

Thank you for your reply to my question. Should I change coco.py like this?

class CocoStuff(BaseDataset):

    def __init__(self, dataroot, annpath, trans_func=None, mode='train'):
        super(CocoStuff, self).__init__(
                dataroot, annpath, trans_func, mode)
        self.n_cats = 10 # 91 stuff, 91 thing, 11 of thing have no annos
        self.lb_ignore = 255

        ## label mapping, remove non-existing labels
        missing = [11, 25, 28, 29, 44, 65, 67, 68, 70, 82, 90]
        remain = [ind for ind in range(182) if not ind in missing]
        self.lb_map = np.arange(256)
        for ind in remain:
            if ind > 9:
                self.lb_map[ind] =255
            else:
                self.lb_map[ind] = remain.index(ind)

        self.to_tensor = T.ToTensor(
            mean=(0.46962251, 0.4464104,  0.40718787), # coco, rgb
            std=(0.27469736, 0.27012361, 0.28515933),
        )

But when I train BiSeNetv2 with coco, the result is as below. The loss is NAN.

iter: 100/180000, lr: 0.003454, eta: 3:42:53, time: 7.51, loss: nan, loss_pre: nan, loss_aux0: nan, loss_aux1: nan, loss_aux2: nan, loss_aux3: nan
iter: 200/180000, lr: 0.004348, eta: 3:30:10, time: 6.59, loss: nan, loss_pre: nan, loss_aux0: nan, loss_aux1: nan, loss_aux2: nan, loss_aux3: nan
iter: 300/180000, lr: 0.005474, eta: 3:25:49, time: 6.59, loss: nan, loss_pre: nan, loss_aux0: nan, loss_aux1: nan, loss_aux2: nan, loss_aux3: nan
iter: 400/180000, lr: 0.006892, eta: 3:23:39, time: 6.60, loss: nan, loss_pre: nan, loss_aux0: nan, loss_aux1: nan, loss_aux2: nan, loss_aux3: nan
iter: 500/180000, lr: 0.008676, eta: 3:22:14, time: 6.59, loss: nan, loss_pre: nan, loss_aux0: nan, loss_aux1: nan, loss_aux2: nan, loss_aux3: nan
iter: 600/180000, lr: 0.010923, eta: 3:21:34, time: 6.65, loss: nan, loss_pre: nan, loss_aux0: nan, loss_aux1: nan, loss_aux2: nan, loss_aux3: nan
iter: 700/180000, lr: 0.013751, eta: 3:20:54, time: 6.61, loss: nan, loss_pre: nan, loss_aux0: nan, loss_aux1: nan, loss_aux2: nan, loss_aux3: nan
iter: 800/180000, lr: 0.017311, eta: 3:20:28, time: 6.63, loss: nan, loss_pre: nan, loss_aux0: nan, loss_aux1: nan, loss_aux2: nan, loss_aux3: nan
iter: 900/180000, lr: 0.021794, eta: 3:20:03, time: 6.62, loss: nan, loss_pre: nan, loss_aux0: nan, loss_aux1: nan, loss_aux2: nan, loss_aux3: nan
iter: 1000/180000, lr: 0.027437, eta: 3:19:47, time: 6.65, loss: nan, loss_pre: nan, loss_aux0: nan, loss_aux1: nan, loss_aux2: nan, loss_aux3: nan

CoinCheung commented 2 years ago

Maybe you have too much ignored labels. There are 171 categories, but you ignored 161 of them. If you do not really care about category meanings, you can merge them rather than ignore them.

wsy588 commented 2 years ago

Maybe you have too much ignored labels. There are 171 categories, but you ignored 161 of them. If you do not really care about category meanings, you can merge them rather than ignore them.

I increase the category and the loss becomes normal. Thanks a lot!

CoinCheung / BiSeNet

COCOStuff RuntimeError: CUDA error: an illegal memory access was encountered #248