Error during training for custom dataset

bao18 commented 5 years ago

When trying to train the model by the command below, a RuntimeError occurred, it seems that some problems with the GPUs (four GPU).

command I run

the command I run:

python train.py --gpus 0,1,2,3 --cfg $cfg

Error:

[2019-10-06 08:56:13,423 INFO train.py line 246 3390] Outputing checkpoints to: ckpt/test-resnet50dilated-ppm_deepsup
# samples: 7296
1 Epoch = 5000 iters
Traceback (most recent call last):
  File "train.py", line 273, in <module>
    main(cfg, gpus)
  File "train.py", line 200, in main
    train(segmentation_module, iterator_train, optimizers, history, epoch+1, cfg)
  File "train.py", line 32, in train
    batch_data = next(iterator)
  File "/home/bruno/apps/intelpython3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in __next__
    return self._process_next_batch(batch)
  File "/home/bruno/apps/intelpython3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
AssertionError: Traceback (most recent call last):
  File "/home/bruno/apps/intelpython3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/bruno/apps/intelpython3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/bruno/xView2/semantic-segmentation-pytorch/dataset.py", line 162, in __getitem__
    assert(segm.mode == "L")
AssertionError

hangzhaomit commented 5 years ago

Your label image should be a single channel image, instead of 3-channel.

bao18 commented 5 years ago

@hangzhaomit , thanks for your quick reply, Yes, I also figured it out at last. Now I have the following error, this time it looks GPU problem

Traceback (most recent call last): File "train.py", line 273, in <module> main(cfg, gpus) File "train.py", line 200, in main train(segmentation_module, iterator_train, optimizers, history, epoch+1, cfg) File "train.py", line 42, in train loss = loss.mean() RuntimeError: CUDA error: device-side assert triggered /opt/conda/conda-bld/pytorch_1549628766161/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [0,0,0], thread: [388,0,0] Assertiont >= 0 && t < n_classesfailed.

this last part repeats a lot of times.. /opt/conda/conda-bld/pytorch_1549628766161/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [0,0,0], thread: [388,0,0] Assertiont >= 0 && t < n_classes` failed.``

hangzhaomit commented 5 years ago

In the default setup, label=0 is ignored. So if you have two classes, please set them as 1 and 2.

bao18 commented 5 years ago

Thanks a lot! The code is running now.

jeewa985 commented 5 years ago

@bao18 I am jeewa here. Did you able to train and validate the model with your own data set successfully? I have encountered a dimension mismatch error during validation phase. I have checked the dimensions of my ims_shape, seg_color and pred_color dimensions and found to beas follows. imgshape:(512, 512, 3) seg_color shape:(1, 512, 3) pred_color shape:(1, 512, 3)

I have only amended the odgt files, config files acording to my own data set along with GPU configurations. could you please have your comments on where I made the mistake. I will highly appreciate if you can let me know any other amendments I should made for a custom data set.

bao18 commented 5 years ago

@jeewa985 Yes, I was able to train and validate on my own dataset. Basically, what I was doing wrong was: 1) saving the mask images (data/../annotations/training/***.png) in 3-channels images. These files should be 1-channels images. 2) For the same images, I was using 0 and 1 for labels. Since 0 is not recognize, the labels should be 1 and 2 for two classes problem. I hope these tips help you to run the model. Best.

jeewa985 commented 5 years ago

@bao18 thank you very much for your quick respone. In fact I did run the model for my custom data set. But I　have encountered a probel during the evelaution phase as follws.

ValueError: all the input array dimensions except for the concatenation axis must match exactly”

I have checked for the shape of the img, seg_color and predict_color and found that those are not match each other.

I would like to know from you that, did you made any canges to the model when you use it for your own data set.

my data set has 03 classeses (assigned for 1,2,3) indexes and 100 images for training and 20 images for eveluation

Looking foward for a hearing from you. Best regards

DecentMakeover commented 5 years ago

@bao18 out of curiosity , was your 0 label background, or a class you a specific object?

mdt48 commented 4 years ago

@bao18 how did you change the config options to solve this problem?

CSAILVision / semantic-segmentation-pytorch

Error during training for custom dataset #197