Closed dongho-Han closed 7 months ago
Hi! Thanks for your attention to our work!
The NYUDepthV2 trained DFormer-L is used to generate the 40 categories segmentation maps. Your 5 classes may not belong to the 40 classes from NYUDepthV2. So, directly using this model on your dataset may cause two problem: (1) The mentioned size mismatching in the weight. 'size mismatch for decode_head.conv_seg.weight: copying a param with shape torch.Size([40, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([5, 512, 1, 1]).' (2) The model can not segment the image to your 5 classes.
To solve the problems, you need to finetune the trained model on your dataset. If the scale of your dataset is large enough, you can directly train the ImageNet-pretrained DFormer-L instead of NYU-trained DFormer-L. If not, you can set number of classes to 5 and initalize the 'decode_head.conv_seg'. That is to say, change number of predicted classes and initialize 'decode_head.conv_seg', load trained weights for the other part of NYUDepth-trained or SUNRGBD-trained DFormer-L.
Thanks for the reply. I read your solution several times, but couldn't understand a little bit. As I'm currently working on small # of datas, I think I can't easily fine-tune the model.
If not, you can set number of classes to 5 and initalize the 'decode_head.conv_seg'. That is to say, change number of predicted classes and initialize 'decode_head.conv_seg', load trained weights for the other part of NYUDepth-trained or SUNRGBD-trained DFormer-L.
Could you explain about this sentences a little bit more? AFAIK, if my dataset size is small, try to fine-tune the model's decode_head.conv_seg
(first layer) with zero-initialization and use the other layers' weight as same as NYUv2 or SUNRGBD version?
Also, I'm trying to deploy your model on mobile, so i got confused that if I want to use your model in real-world, as the scene's object number changes a lot, the only solution is pretrain the model with many possible numbers of object, and adapt from current scene? Would be any better way to solve this problem? I want to listen your opinion :)
Sure! To generate the final prediction, the features with shape (B,C,H,W) are sent to the last layer to obtain the final segmentation maps with shape (B,N,H,W), where the N is the number of classes. However, in the trained DFormer the N =40. So the trained weight with NYUDepthV2 (40 classes) are not suitable for your dataset. So, at least, you need to change the last layer to match the 5 classes output and randomly initialize it. The initialization process is set in the framework. You don't need any operation, but only not load the last layer weight to the model. You can split your dataset and train the model on the training set with the 5 classes.
The semantic segmentation is constrained to the fixed classes number. That is to say, the training data influence the model. If your training data only contains 5 classes, the model may perform well on these five classes, but not well on other classes. It may be better that the training data and the application scenes are consistent. I think it may be overcame by: (1) More training data which match real-world application scenes. (2) Combine DFormer with SAM data (Segment Anything, estimated Depth with Depth Anything), train a RGBD model for classes -irrelevant segmentation.
I think the key to apply the model to real-world scenes is the suitable and large-scale training data.
If you just want to apply to specific application scene, I think a relatively small-scale training data that consistent with your application scene may be enough.
Yes you're right. I have to change the last layer. I think your answer makes sense. To test easily, (1) Decide the exact number of classes, (2) Fine-tune the model corresponding to the # of classes (3) Test it the above three steps will be the direct way. Thanks for the sharing! I will update my status when the progress goes on. Have a nice day!
Hello! Thanks for the nice work. I have a question of using your pretrained network on evaluation. I'm trying to run your model on my custom dataset... So, I slightly changed the code following
infer.sh
,local_configs/template/DFormer_Large.py
. However, when I try to change the # of classes, https://github.com/VCIP-RGBD/DFormer/blob/2aa25e362807b1027bddb3046a96cf1c8ec89cbf/local_configs/template/DFormer_Large.py#L34I get the error because the NYUv2 pretrained model used NYUv2 dataset, which has 40 classes. Then, is there any way to use custom dataset which has other # of classes?
Error Log: RuntimeError: Error(s) in loading state_dict for EncoderDecoder: size mismatch for decode_head.conv_seg.weight: copying a param with shape torch.Size([40, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([5, 512, 1, 1]). size mismatch for decode_head.conv_seg.bias: copying a param with shape torch.Size([40]) from checkpoint, the shape in current model is torch.Size([5]).
My Settings: In
local_configs/template/DFormer_Large.py
:In
infer.sh
:I'm currently using my own dataset of classes: 5. Using 40 classes could be one solution, but this way makes so much error, so I want to solve this issue. I'm confused that if I want to use with other # of classes, there should be some fine-tuning,,,. or like that. Also, do you have any ideas which model will work well? I'm currently using
NYUv2_DFormer_Large.pth
, but wonder would it be the best solution.Thanks:)