confusion on field of view and model inference time

LinaShanghaitech commented 2 years ago

Hi, RolandGao, nice to see a good job! I see you've done a lot of experiments on the backbone setting, but I still have some confusion after reading your published paper.

First, You calculate the fov of 4095 to see the bottom-right pixel when training cityscape (1024x2048), so you have verify the backbone should be exp48 [ (1,1) + (1,2) + 4 (1, 4) + 7 (1, 14) ] with fov (3807). But I also find the same backbone when training the CamVid (720x960). Why not use a shallow backbone? I am training my own dataset with image resolution (512 x 512), do I need to modify the backbone architecture? Can you give some advice?
Second, I test inference time of regseg. I notice that the speed is not better than other real-time archs due to split and dilated conv even if model costs low GFLOPs. In the application, what we are concerned about is the speed, so is there any strategy to improve the speed?

RolandGao commented 2 years ago

My Cityscapes model got no1 on CamVid without extra tuning, so I just left it at that. I could've tried reducing the fov for CamVid and maybe it would've gotten even better accuracy.

If you're training on images with resolution (512 x 512), you need a fov of 1025 for every output pixel to see every input pixel. You should decrease the dilation rates of the backbone to somewhat match the fov requirement. For example, [(1,1) + 5(1,2) + 7(1,4)] has a fov of 1311, so this configuration could work pretty well.

If your dataset is big enough, you can try training the model from scratch. If the dataset is not that big, pretraining the model on a larger dataset will help. Since decreasing the dilation rates does not change the number of parameters, you can load the Cityscapes pretrained model with the above configuration and then fine-tune it on your dataset. This should work, although it might not achieve the best accuracy possible. A better but possibly more time-consuming way is to train on COCO, which has images with resolution around 512x512, and then fine-tune it on your dataset. This codebase already supports training on COCO. Check configs/coco_100epoch.yaml and datasets/coco.py for details. Since COCO is large, training for longer will usually lead to better accuracy. It depends on how much compute and effort you want to spend.

The easy way to improve the speed is to decrease the resolution of the image, although that could lead to lower accuracy. You can also try building a new model that reduces the number of channels and the number of blocks, but that requires more work on your side.

LinaShanghaitech commented 2 years ago

Thanks for your timely reply. Limited by the resource, I cannot pretrain the cityscape and my dataset is parking lots that are different from the autonomous driving scenario. So, I have to train the model from scratch with the config of [(1,1) + 5(1,2) + 7(1,4)], but it still does not work well, especially in some classes. The performance is similar to the original setting, I think maybe there is a great difference. I use this model to segment the parking line and solid lines in the parking lots, the minimum down sampling scale is 4 in the network, but the number of pixels for the parking line is around 6 in the 512 x 512 image. I guess that it cannot decode well only with bilinear interpolate in the last layer to the input size?

RolandGao commented 2 years ago

A pixel width of 6 is actually not too bad. Most semantic segmentation models today have output stride 4 or 8.

If you can't pretrain on Cityscapes, you can use one of the already pretrained models like this one (https://github.com/RolandGao/RegSeg/releases/download/v1.0-alpha/cityscapes_exp48_decoder26_trainval_1000_epochs_1024_crop_bootstrapped_run1). Your dilation rates would be different from the original pretrained model's, but you can still load the model weights. Also, make sure to tune your learning rate. The optimal learning rate is different depending on if you use pretrained weights or not. For example, we used a learning rate of 0.005 for CamVid when using Cityscapes pretrained model.

LinaShanghaitech commented 2 years ago

A pixel width of 6 is actually not too bad. Most semantic segmentation models today have output stride 4 or 8.

If you can't pretrain on Cityscapes, you can use one of the already pretrained models like this one (https://github.com/RolandGao/RegSeg/releases/download/v1.0-alpha/cityscapes_exp48_decoder26_trainval_1000_epochs_1024_crop_bootstrapped_run1). Your dilation rates would be different from the original pretrained model's, but you can still load the model weights. Also, make sure to tune your learning rate. The optimal learning rate is different depending on if you use pretrained weights or not. For example, we used a learning rate of 0.005 for CamVid when using Cityscapes pretrained model.

Thanks a lot. I will try it and wish it have a good result :)

Thyhyh99 commented 2 years ago

Thank you very much,i'm looking for this already pretrained model.

RolandGao / RegSeg

confusion on field of view and model inference time #8