aau-cns / poet

PoET: Pose Estimation Transformer for Single-View, Multi-Object 6D Pose Estimation
Other
66 stars 12 forks source link

Train using custom dataset #17

Open z7r7y7 opened 8 months ago

z7r7y7 commented 8 months ago

Thank you for providing this excellent project! I would like to train on my custom dataset. In the backbone.py file, I noticed that setting self[0].train_backbone to True results in a NotImplementedError. Does this mean that the backbone is currently not trainable? If I want to use Mask R-CNN as the backbone for training on my custom dataset, what steps should I follow?

tgjantos commented 8 months ago

Yes, currently PoET does not support the training of the backbone. We intended PoET to be an extension to any pre-trained backbone.

If you want to use Mask R-CNN as the backbone for training on your custom dataset, you should first pre-train it separately on your dataset for object detection. Once you have the network trained you can include the pre-trained weights in the PoET training with the argument --backbone_weights.

Hope this helps you!

Best, Thomas

z7r7y7 commented 8 months ago

Thank you for your reply! I noticed that there are many versions of Mask R-CNN on GitHub. May I ask which version's weights can be directly loaded with the argument --backbone_weights?

tgjantos commented 8 months ago

You can check it out in the backbone_maskrcnn.py file. As of now you can use the model how it is provided by PyTorch with a ResNet-50 backbone.

However, you extend the code to use any object detector backbone you want as long as you return the necessary feature maps and detected object.

z7r7y7 commented 8 months ago

Thank you for your reply! I want to incorporate depth information into the input. Can I use the detection results of an object detection model and fuse them with the output of another backbone network that contains depth information?

tgjantos commented 8 months ago

Does the backbone network also contain RGB information? In general you can do that. The object detections do not have to come from the same network that does provide the feature maps.

However, I think 6D relative object pose estimation, purely based on depth images might be difficult.

On the other hand, combining RGB information with depth information should improve the performance.

z7r7y7 commented 8 months ago

You're right. Due to the presence of objects with similar shapes but varying sizes in my custom dataset, and the uncertainty of scale in monocular RGB images, relying solely on RGB images may yield suboptimal results. Therefore, I intend to fuse depth information with RGB information as input to the network.

tgjantos commented 8 months ago

I don't see any limitation regarding the transformer part to process a combination of RGB and depth feature maps. Therefore, if you have a backbone network that produces such feature maps, it should work out!

Let me know how it goes!

Best, Thomas

z7r7y7 commented 8 months ago

Thank you so much for your response. I will definitely try incorporating the RGB and depth feature maps into the Transformer model and see how it performs. I'm excited about the possibilities! If I make any progress or have any further questions, I would be delighted to continue the conversation with you. Your support and interest mean a lot to me. Best, Ruiyun

tgjantos commented 8 months ago

Definitely, I would be happy to continue the discussion and help you out whenever needed!

Best, Thomas