Centroid file causing issues with 2 GPUs but not with 1 GPU

rod409 commented 3 years ago

I am trying to train this model on the IDD dataset (https://idd.insaan.iiit.ac.in/). I wrote an idd.py and idd_labels.py to correspond with training on cityscapes. If I use only 1 GPU I can train the model but if I use 2 GPUs the program fails after building the centroid file and before training begins. If I turn off using the centroid file, set class_uniform_pct=0, then I can make the model train with 2 GPUs.

What could be causing things to work on 1 GPU but not 2 with the centroid file? Thanks!

ajtao commented 3 years ago

It's probably a good idea to run the 1GPU training first to generate the centroid file. The 2-GPU run will then simply re-use the existing centroid file.

If you could share the failure information (what error message?) that'd help.

rod409 commented 3 years ago

This is the output of trying to train on 2 GPUs after letting the centroid be built with 1 GPU. There is a key error in this case. Before when trying to build the centroid with 2 GPUs the program would just stall before training with no error message.

mode train found 7034 images cn num_classes 19 Loading centroid file /home/test/large_assets/uniform_centroids/idd_cv0_tile1024.json Found 11 centroids Class Uniform Percentage: 0.5 Class Uniform items per Epoch: 7034 cls 0 len 6797 cls 1 len 3230 Traceback (most recent call last): File "train.py", line 602, in main() File "train.py", line 340, in main datasets.setup_loaders(args) File "/home/test/semantic-segmentation/datasets/init.py", line 182, in setup_loaders label_transform=target_train_transform) File "/home/test/semantic-segmentation/datasets/idd.py", line 155, in init self.build_epoch() File "/home/test/semantic-segmentation/datasets/base_loader.py", line 71, in build_epoch self.train) File "/home/test/semantic-segmentation/datasets/uniform.py", line 308, in build_epoch msg = "cls {} len {}".format(class_id, len(centroids[class_id])) KeyError: 2 Traceback (most recent call last): File "train.py", line 602, in main() File "train.py", line 340, in main datasets.setup_loaders(args) File "/home/test/semantic-segmentation/datasets/init.py", line 182, in setup_loaders label_transform=target_train_transform) File "/home/test/semantic-segmentation/datasets/idd.py", line 155, in init self.build_epoch() File "/home/test/semantic-segmentation/datasets/base_loader.py", line 71, in build_epoch self.train) File "/home/test/semantic-segmentation/datasets/uniform.py", line 308, in build_epoch msg = "cls {} len {}".format(class_id, len(centroids[class_id])) KeyError: 2 Traceback (most recent call last): File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 253, in main() File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'train.py', '--local_rank=1', '--dataset', 'idd', '--cv', '0', '--syncbn', '--apex', '--fp16', '--crop_size', '1080,1920', '--bs_trn', '1', '--poly_exp', '2', '--lr', '5e-3', '--max_epoch', '25', '--n_scales', '0.5,1.0,2.0', '--supervised_mscale_loss_wt', '0.05', '--snapshot', 'ASSETS_PATH/seg_weights/ocrnet.HRNet_industrious-chicken.pth', '--arch', 'ocrnet.HRNet_Mscale', '--class_uniform_pct', '0.5', '--result_dir', 'logs/train_idd2/ocrnet.HRNet_Mscale_manipulative-marten_2021.03.08_18.36']' returned non-zero exit status 1.

rod409 commented 3 years ago

This may be relevant information, during training it appears many classes are not captured. I checked the validation image labels and the formatting appears correct and matches cityscapes

IoU Id label iU_1.0 TP FP FN Precision Recall ---- --------------------- -------- ----- ------ ------ ----------- -------- 0 road 88.26 82.38 0.00 0.13 1.00 0.88 1 sidewalk 24.66 1.15 2.83 0.22 0.26 0.82 2 building nan 0.00 nan nan nan nan 3 wall nan 0.00 nan nan nan nan 4 fence nan 0.00 nan nan nan nan 5 pole nan 0.00 nan nan nan nan 6 traffic light nan 0.00 nan nan nan nan 7 traffic sign nan 0.00 nan nan nan nan 8 vegetation nan 0.00 nan nan nan nan 9 non-drivable fallback nan 0.00 nan nan nan nan 10 sky nan 0.00 nan nan nan nan 11 person nan 0.00 nan nan nan nan 12 rider nan 0.00 nan nan nan nan 13 car nan 0.00 nan nan nan nan 14 truck nan 0.00 nan nan nan nan 15 bus nan 0.00 nan nan nan nan 16 train 38.28 5.11 1.58 0.04 0.39 0.96 17 motorcycle nan 0.00 nan nan nan nan 18 bicycle nan 0.00 nan nan nan nan Mean 50.40 ----------------------------------------------------------------------------------------------------------- this : [epoch 9], [val loss 1.32507], [acc 0.88645], [acc_cls 0.54950], [mean_iu 0.50398], [fwavacc 0.78879] best : [epoch 6], [val loss 0.08727], [acc 0.97358], [acc_cls 0.90800], [mean_iu 0.84893], [fwavacc 0.95049] ----------------------------------------------------------------------------------------------------------- Class Uniform Percentage: 0.5 Class Uniform items per Epoch: 7034 cls 0 len 6797 cls 1 len 3230 cls 2 len 0 cls 3 len 0 cls 4 len 0 cls 5 len 0 cls 6 len 0 cls 7 len 0 cls 8 len 0 cls 9 len 6586 cls 10 len 0 cls 11 len 185 cls 12 len 6874 cls 13 len 4573 cls 14 len 4594 cls 15 len 2848 cls 16 len 4964 cls 17 len 3426 cls 18 len 6895

ajtao commented 3 years ago

It seems problematic that your log file shows 19 classes but yet the centroid data structure thinks there are 11 classes -> "Found 11 centroids". You've got to square that away. You must have defined the number of classes as 19 somewhere and 11 somewhere else.

rod409 commented 3 years ago

I am setting the number of classes to 19 in the idd.py file, this is for labels matching cityscapes. I am not setting 11 classes anywhere. I also tested with different category labels which has 26 total classes. In this case 20 centroids are found. I checked the ground truth label images and confirm that every class is actually present. Are there constraints on finding centroids, like a class has to appear a minimum number of times?

ajtao commented 3 years ago

I can see now that your code is crashing because the centroids dict has no entry for classes 2,3,4,5,6,7,8,10, and so when it hits uniform.py:308, it crashes.

The uniform sampling code will not work if you have a dataset with classes that don't have any pixels.

I would suggest to change the dataset definition to only include classes that truly exist.

rod409 commented 3 years ago

All the defined classes do exist and I confirmed there are pixels labeled for each class.

ajtao commented 3 years ago

If that's the case, then we shouldn't be seeing this message:

cls 2 len 0
cls 3 len 0
cls 4 len 0
...

I would rename the current centroid file for safe keeping, and then run a single GPU training job which will rebuild the centroid file when it calles uniform.build_centroids() in the init() routine of your dataloader. Confirm that when it runs datasets.uniform.build_epoch() that it prints out non-zero numbers for all the classes.

rod409 commented 3 years ago

The output I showed above for length 0 class centroids is for 1 GPU. Both 1 and 2 GPUs will show several classes with len 0.

ajtao commented 3 years ago

For whatever reason, the centroid building code doesn't believe any pixels of those classes exist. That's causing the code to break.

This is the code that calculates whether a given file has pixels with a given class. You could always instrument this code to print out some debug info. https://github.com/NVIDIA/semantic-segmentation/blob/main/datasets/uniform.py#L84-L135.

I'm wondering if somehow you have an id2trainid mapping. That translates native mapping to the training class ID. Such a mapping is used with cityscapes. For example here: https://github.com/NVIDIA/semantic-segmentation/blob/main/datasets/cityscapes.py#L123.

rod409 commented 3 years ago

Thanks for the info, I do have id2trainid mapping. I will try to debug that code to see what I can find.

rod409 commented 3 years ago

I resolved the problem. I needed to change the label png files to using the id for each name and not the training id. I get all centroids build correctly now. Thanks for the help!

hnsywangxin commented 2 years ago

@ rod409 hello, I have the same problem, can you introduce your solution in detail, thanks. my label png was generated by json_to_dataset.py

NVIDIA / semantic-segmentation

Centroid file causing issues with 2 GPUs but not with 1 GPU #127