NVIDIA / semantic-segmentation

Nvidia Semantic Segmentation monorepo
BSD 3-Clause "New" or "Revised" License
1.77k stars 388 forks source link

Finetuning on KITTI #23

Closed HanqingXu closed 5 years ago

HanqingXu commented 5 years ago

Hi, I am impressed by the outcome of your work on multiple datasets. However, I wonder if you can offer some training details about finetuning on KITTI to achive 72.8% on test set since the amount of data on KITTI is highly limited. Sincere thanks.

HanqingXu commented 5 years ago

To further explain my confusion, I wonder if you can explain the training details such as 1.Whether you pretrian the dataset on Cityscape or Mapillary 2.During training on KITTI, I saw the original paper says you only trian 90 epochs. Is the number referring to 90 epochs for KITTI because the iteration times would be quite few Looking foward to your reply.

bryanyzhu commented 5 years ago

Sure, thank you for your interest in our work.

  1. Yes, we pretrain our model on Cityscapes. We found it is very effective, because both datasets are captured in Germany cities and the data distribution maybe similar.

  2. Yes, the 90 epochs is for KITTI, so the fine-tuning procedure is very fast. As we said in the paper (and also in the code), we use 90/10 as train/val split to fine-tune on KITTI. As I vaguely recall, our best model on KITTI is the 66th epoch model trained on val_2 split.

HanqingXu commented 5 years ago

Thank you for your prompt reply , it helps a lot. But still I have a few more questions : 1.Since you didn't publish the result on KITTI with baseline, I wonder if the biggest reason your model showed such a fine performance is that it has an extremely high baseline( and it surely drops a little on KITTI due to the distribution difference)

  1. I wonder if you employ exactly the same video propogation and label relaxation on KITTI like what you did on Cityscape

3.I'm curretnly working on the problem of having nice generalization performance on small datasets(exactly like KITTI) , so any advice from your previous experience or general idea would be greatly helpful

Again genuinely thank you for the reply

HanqingXu commented 5 years ago

And last one , you said the model performed best at the 66th epoch trained on val_2 split , what do you mean by that? How can you tell it would be the best one? Thank you.

bryanyzhu commented 5 years ago
  1. Yes, I'm sure my baseline is high. But due to the submission policy of KITTI benchmark, I can only submit once for one publication. So I didn't try my baseline on KITTI. You can try and submit it on yourself.

  2. I didn't use video propagation for KITTI because of two reasons. One is KITTI is too small, there is no diversity, so no need to propagate. Second reason is although KITTI has raw videos, but we don't know the corresponding video frames. So there is no way (at least to my understanding) to easily use video information. I used label relaxation, just as I did to Cityscapes.

  3. From my side, I think two things. (1) Try some self-supervised or weakly-supervised pre-training on large driving dataset. The more data you see, the more robust your model is. (2) Try not to overfit.

  4. I test many models on the whole training dataset to see which model has the highest mIoU. Because KITTI has only 200 images, it's pretty fast to do this kind of evaluation to pick the "best" one. But it may not be the best due to train/test distribution difference. It just seems to perform the best to me.

Hope this helps.

HanqingXu commented 5 years ago

Thank you again for the generous help!

HanqingXu commented 5 years ago

I wonder if you still have the configuration script for finetuning the model on KITTI because I tried many of them and the miou always dramatically drop (compared with Cityscape).

bryanyzhu commented 5 years ago

I didn't have it at my hand, but as I can recall, the recipe should be something like below:

python -m torch.distributed.launch --nproc_per_node=8 train.py \
        --dataset kitti \
        --cv 2 \
        --arch network.deepv3.DeepWV3Plus \
        --snapshot ./pretrained_models/YOUR_TRAINED_CITYSCAPES_MODEL \
        --class_uniform_pct 0.5 \
        --class_uniform_tile 360 \
        --lr 0.001 \
        --lr_schedule scl-poly \
        --poly_exp 1.0 \
        --repoly 1.5  \
        --rescale 1.0 \
        --sgd \
        --crop_size 360 \
        --scale_min 1.0 \
        --scale_max 2.0 \
        --color_aug 0.25 \
        --max_epoch 90 \
        --jointwtborder \
        --strict_bdr_cls 5,6,7,11,12,17,18 \
        --rlx_off_epoch 20 \
        --wt_bound 1.0 \
        --bs_mult 2 \
        --syncbn \
        --apex \
        --exp kitti_ft \
        --ckpt ./logs/ \
        --tb_path ./logs/

The training IoU in the end is around 65.

HanqingXu commented 5 years ago

That' s really helpful . Just out of curiosity, if the training Iou is around 65 ,how come it reaches 72.82 at the final test set ?

bryanyzhu commented 5 years ago

Two things:

kwea123 commented 5 years ago

Hi, I have question concerning your answer here:

I didn't use video propagation for KITTI because of two reasons. One is KITTI is too small, there is no diversity, so no need to propagate. Second reason is although KITTI has raw videos, but we don't know the corresponding video frames.

  1. What do you mean by "there is no diversity"? In my understanding your propagation is a just way to increase diversity so it's better to apply it, isn't it?

  2. I'm sure that most of the frames (or maybe all) can be found in the raw data; you just need to do an exhaustive search frame-by-frame to find the exact frame. It is easy to find them by a simple code, it is certainly time-consuming but totally possible. Maybe it's worth a try for those who are interested.

bryanyzhu commented 5 years ago

@kwea123

  1. Yes, video propagation is a way to increase diversity. But for KITTI, since we only have 200 training images, even performing video propagation will not bring in too much diversity (in my opinion). For example, if you have a model pretrained on ImageNet and your dataset only has several hundred images, people usually just fine-tine the model a little bit (like early stopping, freezing most of the weights, etc.). So I don't want to use too many tricks to overfit to the KITTI training set. But maybe adding video propagation will give you extra benefit, I just didn't try it back then.

  2. Yes, you are right. You can do an exhaustive search to find the exact frame. Actually I did this kind of exhaustive search on CamVid data. But KITTI raw dataset is kind of large, which is very time consuming to find the correspondence. Thank you for your clarification, and it's worth a try for those who are interested.

karansapra commented 5 years ago

@kwea123 @bryanyzhu Please let me know if the issue has resolved.

bryanyzhu commented 5 years ago

@karansapra I think the issue has been resolved. You can close it for now. They can reopen it if something comes up. Thanks.

resha1417 commented 3 years ago

Sure, thank you for your interest in our work.

  1. Yes, we pretrain our model on Cityscapes. We found it is very effective, because both datasets are captured in Germany cities and the data distribution maybe similar.
  2. Yes, the 90 epochs is for KITTI, so the fine-tuning procedure is very fast. As we said in the paper (and also in the code), we use 90/10 as train/val split to fine-tune on KITTI. As I vaguely recall, our best model on KITTI is the 66th epoch model trained on val_2 split.

Hello @bryanyzhu , I am trying to finetune this code on KITTI dataset, however while dumping the results i could observe that it dumps original segmentation mask(binary). At the time of cityscapes it dumps colorize mask, but it is not working that way for KITTI dataset. I used the same kind of data loader as cityscapes. Can you please suggest me what could be the issue??

Thanks in advance.