Error when I train my own dataset

HyuanTan commented 7 years ago

Hi, Thanks for you share. I train the model on CityScape Dateset and get the results which the paper show. I want to train the model on my own dataset, but met some issue. When I train on 2 classes(including background), and change NUM_CLASSES=20(same with original code), the train process work find but the predict result look strange: arriveroom_000002_000030_leftimg8bit

when I change NUM_CLASSES=20 and def __init__(self, nClasses, ignoreIndex=0) in iouEval.py(because my background is 0), in the encode val stage:

----- VALIDATING - EPOCH 1 ----- Traceback (most recent call last): File "main.py", line 545, in <module> main(parser.parse_args()) File "main.py", line 499, in main model = train(args, model, True) #Train encoder File "main.py", line 334, in train iouEvalVal.addBatch(outputs.max(1)[1].unsqueeze(1).data, targets.data) File "/media/holly/Code/Segmentation/ERFNet/erfnet_pytorch/train/iouEval.py", line 41, in addBatch x_onehot = x_onehot[:, :self.ignoreIndex] ValueError: result of slicing is an empty tensor

when I change NUM_CLASSES=2 and def __init__(self, nClasses, ignoreIndex=19), in the decode strage:

========== DECODER TRAINING =========== /DataSet/DSHolly/DataAll/SegmentationLikeCityScapes_room/leftImg8bit/train /DataSet/DSHolly/DataAll/SegmentationLikeCityScapes_room/leftImg8bit/val <class 'criterion.CrossEntropyLoss2d'> ----- TRAINING - EPOCH 1 ----- LEARNING RATE: 0.0005 THCudaCheck FAIL file=/pytorch/torch/lib/THCUNN/generic/Threshold.cu line=66 error=59 : device-side assert triggered THCudaCheck FAIL file=/pytorch/torch/lib/THCUNN/generic/Threshold.cu line=66 error=59 : device-side assert triggered Traceback (most recent call last): File "main.py", line 541, in <module> main(parser.parse_args()) File "main.py", line 514, in main model = train(args, model, False) #Train decoder File "main.py", line 260, in train loss.backward() File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/autograd/variable.py", line 156, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/autograd/__init__.py", line 98, in backward variables, grad_variables, retain_graph) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/autograd/function.py", line 91, in apply return self._forward_cls.backward(self, *args) File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/_functions/thnn/auto.py", line 187, in backward return (backward_cls.apply(input, grad_output, ctx.additional_args, ctx._backend, ctx.buffers, *tensor_params) + File "/media/holly/Code/.pyenv/versions/Python3.6.3ERFNet/lib/python3.6/site-packages/torch/nn/_functions/thnn/auto.py", line 219, in backward_cls_forward update_grad_input_fn(ctx._backend.library_state, input, grad_output, grad_input, *gi_args) RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THCUNN/generic/Threshold.cu:66

Are there some tips if I want to train model on my own data set? Thanks!!

Eromera commented 7 years ago

Hi! I think that the problem is that the code for IoU estimation is assuming that the Ignored class would always be the LAST one. This is: if you have 2 classes (background and something else), then something else should be label "0" and ignore should be label "1". This is the problem you are getting in line: x_onehot = x_onehot[:, :self.ignoreIndex] It's assuming that the predictions without the ignore label are the first ones and then the last is ignore class. If you want to train without ignore class, you should be fine by passing the argument "ignoreIndex=-1" at the end of the iouEval class creation: iouEvalVal = iouEval(NUM_CLASSES, ignoreIndex=-1)

But if you want to train with ignore class, then you should move that label to the last one when relabelling the loaded labels. But I think that if you have 2 classes it may make more sense to learn both classes (even if 1 class is background) without ignoring 1, than to learn ignoring background and only using that specific class in the backpropagation.

So you should be fine by using NUM_CLASSES=2 and "ignoreIndex=-1" I think. Can you confirm that this works for you?

HyuanTan commented 7 years ago

Hi, thanks for your respond. I did as you suggested that using NUM_CLASSES=2 and "ignoreIndex=-1" , and seted the background to 1, something else to 0. When in decoder training, I got something like this: erfnet-2 bofore loss.backward():

before

after loss.backward():

erfnet

One more question, I found that you use target = ImageOps.expand(target, border=(transX,transY,0,0), fill=255)#pad label filling with 255, but I still found 19 in targets : erfnet_3

Thanks!

Eromera commented 7 years ago

Thanks,

The first error in the loss should be because the labels do not only contain classes 0 and 1 as expected, so I think it's related to the same problem that you point below.

The code was only tested with cityscapes so I tried to be general with some modifications (like using other dataset) but it is always difficult to do extensive debugging with other datasets. So the problem here is that the code was prepared for cityscapes and some lines contain some hardcoded assumptions, like that you will use 255 or 19 for ignore label.

You are seeing 19s in the targets because in main.py in MyCoTransform that is used when loading the data there is a line that is: target = Relabel(255, 19)(target)

You should either remove this (if you dont use ignore label) or change it to your background class. If you remove it, then you should change the fill=255 in the translation operation to your background class.

I will try to change the code to be more general as soon as I have time!

HyuanTan commented 7 years ago

OK, I got it.

Thanks for your respond!!

qixuxiang commented 6 years ago

Hi , HyuanTa! Recently I do some research work on real-time semantic segmentation and I have to train the model on my own dataset. I am very appreciated if you can share code that can train my own dataset!! My email address is qixuxiang@126.com,THX a lot and happy new Year! @HyuanTan @Eromera

HyuanTan commented 6 years ago

@qixuxiang Hi,the training code is already in main.py, what you need to do is transforming your own data set into CityScape format.Some useful API can be found in cityscapesScripts.

Eromera / erfnet_pytorch

Error when I train my own dataset #4