KaihuaTang / Scene-Graph-Benchmark.pytorch

A new codebase for popular Scene Graph Generation methods (2020). Visualization & Scene Graph Extraction on custom images/datasets are provided. It's also a PyTorch implementation of paper “Unbiased Scene Graph Generation from Biased Training CVPR 2020”
MIT License
1.08k stars 229 forks source link

Using a detector model pretrained on 1750-700-400 vg datasets, the mAP is relatively low when testing #84

Open Einstone-rose opened 4 years ago

Einstone-rose commented 4 years ago

❓ Questions and Help

Hi, I generate a new vg datasets (1750-700-400) followed this repro https://github.com/danfeiX/scene-graph-TF-release and use the scripts generate_attribute_labels.py you provide. Then, we trained the detector model based on this new datasets (1750-700-400), and test it, but the evaluation result is relatively low: we only get the 0.06 mAP. I wonder that whether this is normal phenomenon or not. Also, I observed that in this repro https://github.com/peteanderson80/bottom-up-attention, the performance is similarly low when trained on 1600-400-20 vg datasets (jump to this, https://github.com/peteanderson80/bottom-up-attention#expected-detection-results-for-the-pretrained-model, it only achieves 0.102. The author says that: mAP is relatively low because many classes overlap (e.g. person / man / guy), some classes can't be precisely located (e.g. street, field) and separate classes exist for singular and plural objects (e.g. person / people). We focus on performance in downstream tasks (e.g. image captioning, VQA) rather than detection performance. So far, i have some questions: (1) How do you explain this phenomenon? (2) Does low performance in detection have a negative effect on downstream tasks (e.g. image captioning, VQA)?

bibekyess commented 2 years ago

Hi @Einstone-rose did you figure out why the performance is too low? I also trained on 1000 custom images and the performance is very bad.

narchitect commented 11 months ago

Hi, I'm having the same issue. Did you solve this problem?

bibekyess commented 11 months ago

Hi @narchitect! As far as I remember, I was getting low performance because of the issue in my custom dataset. There was bounding boxes scaling mismatch with respect to the picture size in the training images. Once that was sorted, the results were good. Thank you! :)

narchitect commented 11 months ago

@bibekyess Thanks for your reply! I will double-check my custom vg dataset. Did you also use https://github.com/danfeiX/scene-graph-TF-release this repository to convert the dataset? As I checked before, their data tool also resized the bbox size according to the resized image. So that I thought that because I removed ROI_head layer from the trained weight of faster rcnn. Did you also remove some ROI_head layers when fine-tuning the model?

actually I removed these layers from Faster-rcnn weight 'roi_heads.box.predictor.cls_score.weight', 'roi_heads.box.predictor.cls_score.bias', 'roi_heads.box.predictor.bbox_pred.weight', 'roi_heads.box.predictor.bbox_pred.bias'

thank you a lot !

bibekyess commented 11 months ago

Hi @narchitect! Yeah I also used that repository to convert the dataset. I also think I did change something on faster-rcnn architecture too, I am not sure what though. It has been more than a year I did that project and I already left the lab where I was working so I cannot check on the code too! :(

narchitect commented 11 months ago

@bibekyess oh I see. But it already really helped! Thanks a lot :)

bibekyess commented 11 months ago

@narchitect My Pleasure! :)

narchitect commented 10 months ago

Sorry to bother you again, @bibekyess, but I was wondering if you could share the mAP (mean Average Precision) value for your final SGGen model

No matter how much we tweak the dataset and fine-tune the pretrained Faster R-CNN model provided in this repository, we ultimately have to remove the bbox (bounding box) layers. As a result, when training SGGen, the bbox detection layers are trained solely on our data, without the benefit of pretrained values. Consequently, our mAP doesn't seem to surpass 10%. Just to give you some context, our dataset consists of 377 similar images and 23 classes, which isn't particularly ideal.

Therefore, my conclusion is that the best SGGen model we can obtain from this repository has an mAP of around 25%. Given the subpar quality of our data, I believe achieving an mAP of 12% in fine-tuned models that require bbox detection, like SGGen, is the best we can do.

Out of curiosity, did the SGGen model you fine-tuned achieve an mAP higher than 25%?

bibekyess commented 10 months ago

Hi @narchitect https://github.com/narchitect! Sorry to hear about the performance issue on your setup. I don't remember my performance metrics values but I can say that it was definitely good. We trained on Hope dataset and synthetic block datasets. The Faster R-CNN detection results were very good. For scene graph detections, my predicates were only on and 'clear' so it was not too complicated. You can see my demo scene-graph results on block datasets here: https://drive.google.com/file/d/1sSWC129c15ZNmSCHixfx1hb_Kzl7wCW6/view?usp=sharing. I am not sure if my environment is simple (with only blocks) so the results were good. But based on the values you shared, it seems something is wrong. Maybe you can try with other better quality datasets?

Thank you and I hope you solve your issue soon. :)

On Fri, Jan 5, 2024, 03:36 Nayun Kim @.***> wrote:

Sorry to bother you again, @bibekyess https://github.com/bibekyess, but I was wondering if you could share the mAP (mean Average Precision) value for your final SGGen model

No matter how much we tweak the dataset and fine-tune the pretrained Faster R-CNN model provided in this repository, we ultimately have to remove the bbox (bounding box) layers. As a result, when training SGGen, the bbox detection layers are trained solely on our data, without the benefit of pretrained values. Consequently, our mAP doesn't seem to surpass 10%. Just to give you some context, our dataset consists of 377 similar images and 23 classes, which isn't particularly ideal.

Therefore, my conclusion is that the best SGGen model we can obtain from this repository has an mAP of around 25%. Given the subpar quality of our data, I believe achieving an mAP of 12% in fine-tuned models that require bbox detection, like SGGen, is the best we can do.

Out of curiosity, did the SGGen model you fine-tuned achieve an mAP higher than 25%?

— Reply to this email directly, view it on GitHub https://github.com/KaihuaTang/Scene-Graph-Benchmark.pytorch/issues/84#issuecomment-1877578591, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUV6PQEU2ZDARGG4FFDVVT3YM3ZB5AVCNFSM4RKTY732U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBXG42TOOBVHEYQ . You are receiving this because you were mentioned.Message ID: @.***>