VITA-Group / FasterSeg

[ICLR 2020] "FasterSeg: Searching for Faster Real-time Semantic Segmentation" by Wuyang Chen, Xinyu Gong, Xianming Liu, Qian Zhang, Yuan Li, Zhangyang Wang
MIT License
524 stars 107 forks source link

RuntimeError: merge_sort: failed to synchronize: device-side assert triggered #45

Closed Muaz65 closed 3 years ago

Muaz65 commented 3 years ago

I am trying to train FasterSeg for a custom dataset with six classes. I have formatted the annotations and written datasets class just like cityscapes.py. I am having issue while Pretraining the supernet (Section 1.1 in readMe.md)

CUDA Version: 10.2 torchvision : 0.3.0
torch : 1.1.0

Traceback (most recent call last): File "train_search.py", line 304, in main(pretrain=config.pretrain) File "train_search.py", line 133, in main train(pretrain, train_loader_model, train_loader_arch, model, architect, ohem_criterion, optimizer, lr_policy, logger, epoch, update_arch=update_arch) File "train_search.py", line 243, in train loss = model._loss(imgs, target, pretrain) File "/home/soccer/Desktop/Muaz/FasterSeg/search/model_search.py", line 491, in _loss loss = loss + sum(self._criterion(logit, target) for logit in logits) File "/home/soccer/Desktop/Muaz/FasterSeg/search/model_search.py", line 491, in loss = loss + sum(self._criterion(logit, target) for logit in logits) File "/home/soccer/anaconda3/envs/pipeline_cloned/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/home/soccer/Desktop/Muaz/FasterSeg/tools/seg_opr/loss_opr.py", line 81, in forward index = mask_prob.argsort() RuntimeError: merge_sort: failed to synchronize: device-side assert triggered**

Before this error i get a number of CUDA errors but that doesn't crash the code

NOTE: I recreated the experiment on citspscapes dataset and i am still encountering the same issue.

Gaussianer commented 3 years ago

Did you create your training images with createTrainIdLabelImgs.py? I get the same error when I set json2labelImg( f , dst , "trainIds" ) there, then it works and the error is gone.

chenwydj commented 3 years ago

Hi @Muaz65,

Thank you for your interest in our work!

Previously when I faced this error, usually it was due to the mismatch between the model's output dimension for classes v.s. the range of integers in the ground truth files.

Therefore, in your case, one thing you could check is 1) whether you set your model output dimension as six; 2) if your ground truth files only contain integers from 0 to 5.

Hope that helps.

Muaz65 commented 3 years ago

i think range of integers in the ground truth file is the issue here. I ll' confirm it and let you know. ThankYou!