DLR-RM / AugmentedAutoencoder

Official Code: Implicit 3D Orientation Learning for 6D Object Detection from RGB Images
MIT License
338 stars 97 forks source link

Cannot reproduce the detection result #60

Closed pengsida closed 4 years ago

pengsida commented 4 years ago

Hi, deer author,

In the 3.6.1 section of your paper, the RetinaNet acheives 0.73mAP@0.5IoU on the tless dataset. However, my best detection model gives 0.599mAP@0.5IoU using your training data.

Could you please tell me the training details? Did you use only the training provided in this code? I will very appreciate you if a pretrained model is provided!

DateBro commented 4 years ago

@pengsida Hi, is your 0.599mAP@0.5IoU for the best mAP result or for the average mAP among classes?The best mAP of my resnet50_csv_50 model is ('1008 instances of class', 'obj_25', 'with average precision: 0.5708'), and the average mAP among classes is just mAP: 0.2543.

pengsida commented 4 years ago

Yes. Actually, I got the similar results to yours when using the exact training data provided by the author. I did additional data augmentation to improve the detection performance.

DateBro commented 4 years ago

By the way, have you evaluated 6D pose estimation of AAE with ground truth bounding boxes following the README? I got some problems with the translation error and want to communicate with someone who has completed it without any problems. After all, we can't do anything before the author informs us about the training details.

pengsida commented 4 years ago

No, I did not run the pose estimation part.

DateBro commented 4 years ago

Okay. I'm looking forward to your pose estimation result with ground truth bounding boxes, which can give me some tips about my wrong operation during training or evaluating.

pengsida commented 4 years ago

I have no plan to run the pose estimation part of this code. I have been comparing PVNet with this paper on the tless dataset.

DateBro commented 4 years ago

Excuse me, your comparison doesn't contain like 2d projection error or translation error which needs 6d pose estimation first?

pengsida commented 4 years ago

The results are listed in this paper, so I do not need to run the pose estimation code.

DateBro commented 4 years ago

emmm, so you just want to get a 2D detector with high mAP?

pengsida commented 4 years ago

Yes, I want to reproduce the detection result of this paper on the tless dataset.

DateBro commented 4 years ago

Oh, I just found the arxiv version paper added it's detector performance in section 3.6.1, while the paper downloaded from the ECCV website doesn't contain it.

DateBro commented 4 years ago

Maybe you can contact @qyz55 in this issue #51. It seems that he has reproduced some results of the paper.

pengsida commented 4 years ago

@DateBro Thank you. I believe that the author @MartinSmeyer will give a reply.

MartinSmeyer commented 4 years ago

Hi @pengsida @DateBro, The 0.73mAP@0.5IoU is from the IJCV version. My colleague trained and evaluated the RetinaNet, but this number might be higher because it referred to the SISO setting from the SIXD Challenge (so object localization) since all the other works we compare to also use that setting.

pengsida commented 4 years ago

@MartinSmeyer Thanks for your explanation. Does SISO mean that 6D localization of a single instance of a single object? Could you provide me the pretrained model?

MartinSmeyer commented 4 years ago

Does SISO mean that 6D localization of a single instance of a single object? Yes, that would at least explain the difference. I would need to ask my co-author again to be sure, it was not really the focus of the paper to evaluate existing 2D detectors.

Unfortunately, the checkpoint does not work with the current Open Source RetinaNet project since the code was changed for internal usage from an old version of the Fizyr Repo. However, I have a MaskRCNN model that we trained on T-LESS for the recent BOP Challenge using maskrcnn-benchmark which has similar results. I could send you an Email with that one if you wish.

pengsida commented 4 years ago

Ok, thank you. I have been comparing my paper with yours, so I have to ensure that the metric we use is the same. My email is pengsida@zju.edu.cn. Please send me the MaskRCNN model.

One more question is that is the vsd metric also for a single instance of a single object in the scene? Is it the code you provide in this repository? I am very grateful to you for being sooo responsive.

MartinSmeyer commented 4 years ago

Yes, I strictly followed the evaluation rules from the SIXD Challenge 2017 with a single instance for each object in the scene and vsd metric. Still, it was not just localization because if the detection score of an object in the scene was under a certain threshold, the detection was not taken into account. ae_eval.py uses the sixd_toolkit to evaluate SISO vsd error, yes.

Please note that there is now a successor challenge in which we also participated. Probably you have heard of the BOP Challenge that first happened this year at ICCV. It has an automatic evaluation system which ensures that everybody uses the same evaluation metrics. Some datasets have private test labels. The task here is VIVO (varying #instances and #objects) meaning that you have to detect and pose estimate all instances in a scene. It has 3 different metrics, among them vsd. We used the MaskRCNN for that challenge that I will send you. The submission is always open and your results are private until you decide to make them public.

pengsida commented 4 years ago

Actually I carefully read your code these days. I find an important parameter n_top at https://github.com/DLR-RM/AugmentedAutoencoder/blob/6046e9e2963c559fdc0f5fb2be99c18add909ff0/sixd_toolkit_extensions/eval_loc.py#L196 When n_top = 0, it seems to output the VIVO results, as described at https://github.com/DLR-RM/AugmentedAutoencoder/blob/6046e9e2963c559fdc0f5fb2be99c18add909ff0/sixd_toolkit_extensions/eval_loc.py#L109 when n_top = 1, it seems to output the SISO results. The default value of n_top is 0 at https://github.com/DLR-RM/AugmentedAutoencoder/blob/0eace23185249916977f001a2aa16fad3c098fc7/auto_pose/ae/cfg_eval/eval_template.cfg#L21

So the results in your paper is for SISO vsd, and I need to set the n_top = 1? In the section 4.2 of your paper, you also said that the vsd metric is the same as the sixd challenge.

Yes, previously I was invited to the BOP challenge, but I way busy at that time. I will report our results on the BOP challenge if possible.

You are very conscientious about your work. That is so cool!

MartinSmeyer commented 4 years ago

Yes, you are right. For the SISO case top_n_eval should be set to 1.

I have to update the default eval config, otherwise people will evaluate in the VIVO setting and get worse results if there is more than one instance in the image. Thanks for pointing this out.

Lynne-Zheng-Linfang commented 4 years ago

Does SISO mean that 6D localization of a single instance of a single object? Yes, that would at least explain the difference. I would need to ask my co-author again to be sure, it was not really the focus of the paper to evaluate existing 2D detectors.

Unfortunately, the checkpoint does not work with the current Open Source RetinaNet project since the code was changed for internal usage from an old version of the Fizyr Repo. However, I have a MaskRCNN model that we trained on T-LESS for the recent BOP Challenge using maskrcnn-benchmark which has similar results. I could send you an Email with that one if you wish.

Does SISO mean that 6D localization of a single instance of a single object? Yes, that would at least explain the difference. I would need to ask my co-author again to be sure, it was not really the focus of the paper to evaluate existing 2D detectors.

Unfortunately, the checkpoint does not work with the current Open Source RetinaNet project since the code was changed for internal usage from an old version of the Fizyr Repo. However, I have a MaskRCNN model that we trained on T-LESS for the recent BOP Challenge using maskrcnn-benchmark which has similar results. I could send you an Email with that one if you wish.

Hi, Martin, I am also trying to compare the results. Could you please share the pretrained model with me? My email is LXZ948@student.bham.ac.uk. Thank you very much!