VITA-Group / FasterSeg

[ICLR 2020] "FasterSeg: Searching for Faster Real-time Semantic Segmentation" by Wuyang Chen, Xinyu Gong, Xianming Liu, Qian Zhang, Yuan Li, Zhangyang Wang
MIT License
525 stars 105 forks source link

Custom Data Resolution for Training #67

Open DarkMythosIOTA opened 3 years ago

DarkMythosIOTA commented 3 years ago

Hello @chenwydj,

We have already asked how we train FasterSeg with Custom Data, see here. However, we would still have a question regarding the image resolution and the necessary adjustments in the code. We have found several places that match the image resolution or at least have a correlation with it. See here, here, here, here, here, here, here, here, here, here and here.

Do all these values need to be adjusted to the resolution of the data set?

Thank you for providing FasterSeg and the support from your side.

ogkdmr commented 3 years ago

I'm very interested in hearing what @chenwydj has to say about this.

i-am-nut commented 3 years ago

hey @Gaussianer did you manage to trans FasterSeg with customdataset following guidelines in #46 ?

DarkMythosIOTA commented 3 years ago

Hey @emersonjr , yes we have provided a repo for this as well. Look here: https://github.com/Gaussianer/FasterSeg

DarkMythosIOTA commented 3 years ago

However, we cannot yet provide any information in the repo about how far the code has to be adapted to the resolution. We have trained several models, but we wonder if the resolution needs to be adjusted to improve the results.

i-am-nut commented 2 years ago

no worries @Gaussianer thanks for replying here :) I'm also a master student working with real time image segmentation, in my case its aimed for images containing sugar canes and weeds.

I got some questions you probably could help since you did custom training , i'm jus not sure here is the best place though but anyways...

I basically want to train FasterSeg with custom dataset as well, but my classes has nothing to do with any of the Cityscapes classes, my classes are: Sugar Cane and Weeds (should I count Background for the number of clasess as well?) I'm coding them on the ground truth images (annotations) as following: Sugar Canes pixels are [0,0,0], Weeds are [1,1,1] and everything else (Background) is [255,255,255], here's an example image (the image is 1024x2048 by mistake, i know i'll need to generate 2048x1024 instead):

e

What should I change in your repo code to train with dataset containing these images? Thanks beforehand mate!

DarkMythosIOTA commented 2 years ago

On the one hand you have to create the dataset according to the description. For this you have to generate the provided labelDefinitions.csv according to the template. Here you can also see the corresponding attributes for the background (unlabeled). Just try to go through our description, maybe some parts are not documented yet, if you have problems, please contact me. Then I can also improve it, so that others can profit from it.

i-am-nut commented 2 years ago

Thanks @Gaussianer So, I've followed description and also created my own labelDefinitions.csv. here it is:

name,id,trainId,category,catId,hasInstances,ignoreInEval,color_r,color_g,color_b
unlabeled,0,255,void,0,False,False,0,0,0
sugar cane,1,0,void,0,False,False,100,50,15
weeds,2,1,void,0,False,False,247,103,0

Created that way cause my background (unlabeled) pixels on _labelTrainIds.png are [255,255,255], Sugar Canes are [0,0,0] and Weeds are [1,1,1].

I also did edit config_search.py and config_train.py to set C.num_classes = 3 for my case. However, when I run CUDA_VISIBLE_DEVICES=0 python train_search.py I do get the error shown below:

root@5be7442709af:/home/FasterSeg/search# CUDA_VISIBLE_DEVICES=0 python train_search.py
use TensorRT for latency test
use TensorRT for latency test
Experiment dir : search-pretrain-256x512_F12.L16_batch3-20211202-161628
02 16:16:28 args = {'seed': 12345, 'repo_name': 'FasterSeg', 'abs_dir': '/home/FasterSeg/search', 'this_dir': 'search', 'root_dir': '/home/FasterSeg', 'dataset_path': '/home/FasterSeg/dataset', 'img_root_folder': '/home/FasterSeg/dataset', 'gt_root_folder': '/home/FasterSeg/dataset', 'train_source': '/home/FasterSeg/dataset/train_mapping_list.txt', 'eval_source': '/home/FasterSeg/dataset/val_mapping_list.txt', 'num_classes': 3, 'background': -1, 'image_mean': array([0.485, 0.456, 0.406]), 'image_std': array([0.229, 0.224, 0.225]), 'down_sampling': 2, 'image_height': 256, 'image_width': 512, 'gt_down_sampling': 8, 'num_train_imgs': 50, 'num_eval_imgs': 25, 'bn_momentum': 0.1, 'bn_eps': 1e-05, 'lr': 0.02, 'momentum': 0.9, 'weight_decay': 0.0005, 'num_workers': 4, 'train_scale_array': [0.75, 1, 1.25], 'eval_stride_rate': 0.8333333333333334, 'eval_scale_array': [1], 'eval_flip': False, 'eval_height': 1024, 'eval_width': 2048, 'grad_clip': 5, 'train_portion': 0.5, 'arch_learning_rate': 0.0003, 'arch_weight_decay': 0, 'layers': 16, 'branch': 2, 'pretrain': True, 'prun_modes': ['max', 'arch_ratio'], 'Fch': 12, 'width_mult_list': [0.3333333333333333, 0.5, 0.6666666666666666, 0.8333333333333334, 1.0], 'stem_head_width': [(1, 1), (0.6666666666666666, 0.6666666666666666)], 'FPS_min': [0, 155.0], 'FPS_max': [0, 175.0], 'batch_size': 3, 'niters_per_epoch': 400, 'latency_weight': [0, 0], 'nepochs': 20, 'save': 'search-pretrain-256x512_F12.L16_batch3-20211202-161628', 'unrolled': False}
02 16:16:36 params = 2.568351MB, FLOPs = 71.064453GB
architect initialized!
using downsampling: 2
Found 25 images
using downsampling: 2
Found 25 images
using downsampling: 2
Found 25 images
  0%|                                                    | 0/20 [00:00<?, ?it/s]02 16:25:11 True
02 16:25:11 search-pretrain-256x512_F12.L16_batch3-20211202-161628
02 16:25:11 lr: 0.02
02 16:25:11 update arch: False
[Epoch 1/20][trTraceback (most recent call last):        | 0/20 [00:00<?, ?it/s]
  File "train_search.py", line 307, in <module>
    main(pretrain=config.pretrain) 
  File "train_search.py", line 137, in main
    train(pretrain, train_loader_model, train_loader_arch, model, architect, ohem_criterion, optimizer, lr_policy, logger, epoch, update_arch=update_arch)
  File "train_search.py", line 246, in train
    loss = model._loss(imgs, target, pretrain)
  File "/home/FasterSeg/search/model_search.py", line 489, in _loss
    logits = self(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/FasterSeg/search/model_search.py", line 287, in forward
    out_prev = [[stem(input), None]] # stem: one cell
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/FasterSeg/search/operations.py", line 127, in forward
    x = self.conv(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/batchnorm.py", line 83, in forward
    exponential_average_factor, self.eps)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1697, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/pin_memory.py", line 21, in _pin_memory_loop
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 276, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 493, in Client
    answer_challenge(c, authkey)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

Btw the container i'm running stems from installation by Dockerfile process. If I follow the same steps and run the training command above in your provided image from Dockerhub it doesn't detect TensorRT is installed and I get this error:

root@be035b6f0647:/home/FasterSeg/search# CUDA_VISIBLE_DEVICES=0 python train_search.py
/home/FasterSeg/tools/utils/darts_utils.py:179: UserWarning: TensorRT (or pycuda) is not installed. compute_latency_ms_tensorrt() cannot be used.
  warnings.warn("TensorRT (or pycuda) is not installed. compute_latency_ms_tensorrt() cannot be used.")
use PyTorch for latency test
use PyTorch for latency test
Experiment dir : search-pretrain-256x512_F12.L16_batch3-20211202-152200
02 15:22:00 args = {'seed': 12345, 'repo_name': 'FasterSeg', 'abs_dir': '/home/FasterSeg/search', 'this_dir': 'search', 'root_dir': '/home/FasterSeg', 'dataset_path': '/home/FasterSeg/dataset', 'img_root_folder': '/home/FasterSeg/dataset', 'gt_root_folder': '/home/FasterSeg/dataset', 'train_source': '/home/FasterSeg/dataset/train_mapping_list.txt', 'eval_source': '/home/FasterSeg/dataset/val_mapping_list.txt', 'num_classes': 3, 'background': -1, 'image_mean': array([0.485, 0.456, 0.406]), 'image_std': array([0.229, 0.224, 0.225]), 'down_sampling': 2, 'image_height': 256, 'image_width': 512, 'gt_down_sampling': 8, 'num_train_imgs': 0, 'num_eval_imgs': 0, 'bn_momentum': 0.1, 'bn_eps': 1e-05, 'lr': 0.02, 'momentum': 0.9, 'weight_decay': 0.0005, 'num_workers': 4, 'train_scale_array': [0.75, 1, 1.25], 'eval_stride_rate': 0.8333333333333334, 'eval_scale_array': [1], 'eval_flip': False, 'eval_height': 1024, 'eval_width': 2048, 'grad_clip': 5, 'train_portion': 0.5, 'arch_learning_rate': 0.0003, 'arch_weight_decay': 0, 'layers': 16, 'branch': 2, 'pretrain': True, 'prun_modes': ['max', 'arch_ratio'], 'Fch': 12, 'width_mult_list': [0.3333333333333333, 0.5, 0.6666666666666666, 0.8333333333333334, 1.0], 'stem_head_width': [(1, 1), (0.6666666666666666, 0.6666666666666666)], 'FPS_min': [0, 155.0], 'FPS_max': [0, 175.0], 'batch_size': 3, 'niters_per_epoch': 400, 'latency_weight': [0, 0], 'nepochs': 20, 'save': 'search-pretrain-256x512_F12.L16_batch3-20211202-152200', 'unrolled': False}
02 15:22:09 params = 2.568351MB, FLOPs = 71.064453GB
Traceback (most recent call last):
  File "train_search.py", line 306, in <module>
    main(pretrain=config.pretrain) 
  File "train_search.py", line 69, in main
    model = model.cuda()
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 265, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 205, in _apply
    self._buffers[key] = fn(buf)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 265, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 162, in _lazy_init
    _check_driver()
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 82, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")
AssertionError: 
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

So i'm following with the first container. Do you have any idea what's happening in this case?

DarkMythosIOTA commented 2 years ago

@emersonjr Did you install the Docker NVIDIA container runtime as in the installation description?

Regarding TensorRT. Yes we had to remove TensorRT from the environment because it always led to errors during training.

i-am-nut commented 2 years ago

@Gaussianer I noticed that I didn't by mistake. Installed now and retried training but it's still giving the same error. (yes, I did restart Docker service, rebooted, even ran a new container). Have any ideas?

DarkMythosIOTA commented 2 years ago

@emersonjr Have you installed the appropriate graphics card driver as well as CUDA 10.1 and CUDNN? We have provided a guide for CentOS 7 for the setup with Podman.