RuntimeError: merge_sort: failed to synchronize: device-side assert triggered

Muaz65 commented 3 years ago

I am trying to train FasterSeg for a custom dataset with six classes. I have formatted the annotations and written datasets class just like cityscapes.py. I am having issue while Pretraining the supernet (Section 1.1 in readMe.md)

CUDA Version: 10.2 torchvision : 0.3.0
torch : 1.1.0

Traceback (most recent call last): File "train_search.py", line 304, in main(pretrain=config.pretrain) File "train_search.py", line 133, in main train(pretrain, train_loader_model, train_loader_arch, model, architect, ohem_criterion, optimizer, lr_policy, logger, epoch, update_arch=update_arch) File "train_search.py", line 243, in train loss = model._loss(imgs, target, pretrain) File "/home/soccer/Desktop/Muaz/FasterSeg/search/model_search.py", line 491, in _loss loss = loss + sum(self._criterion(logit, target) for logit in logits) File "/home/soccer/Desktop/Muaz/FasterSeg/search/model_search.py", line 491, in loss = loss + sum(self._criterion(logit, target) for logit in logits) File "/home/soccer/anaconda3/envs/pipeline_cloned/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/home/soccer/Desktop/Muaz/FasterSeg/tools/seg_opr/loss_opr.py", line 81, in forward index = mask_prob.argsort() RuntimeError: merge_sort: failed to synchronize: device-side assert triggered**

Before this error i get a number of CUDA errors but that doesn't crash the code

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [2,0,0], thread: [127,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [2,0,0], thread: [61,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

NOTE: I recreated the experiment on citspscapes dataset and i am still encountering the same issue.

Gaussianer commented 3 years ago

Did you create your training images with createTrainIdLabelImgs.py? I get the same error when I set json2labelImg( f , dst , "trainIds" ) there, then it works and the error is gone.

chenwydj commented 3 years ago

Hi @Muaz65,

Thank you for your interest in our work!

Previously when I faced this error, usually it was due to the mismatch between the model's output dimension for classes v.s. the range of integers in the ground truth files.

Therefore, in your case, one thing you could check is 1) whether you set your model output dimension as six; 2) if your ground truth files only contain integers from 0 to 5.

Hope that helps.

Muaz65 commented 3 years ago

i think range of integers in the ground truth file is the issue here. I ll' confirm it and let you know. ThankYou!

VITA-Group / FasterSeg

RuntimeError: merge_sort: failed to synchronize: device-side assert triggered #45