Open Einstone-rose opened 4 years ago
Hi @Einstone-rose did you figure out why the performance is too low? I also trained on 1000 custom images and the performance is very bad.
Hi, I'm having the same issue. Did you solve this problem?
Hi @narchitect! As far as I remember, I was getting low performance because of the issue in my custom dataset. There was bounding boxes scaling mismatch with respect to the picture size in the training images. Once that was sorted, the results were good. Thank you! :)
@bibekyess Thanks for your reply! I will double-check my custom vg dataset. Did you also use https://github.com/danfeiX/scene-graph-TF-release this repository to convert the dataset? As I checked before, their data tool also resized the bbox size according to the resized image. So that I thought that because I removed ROI_head layer from the trained weight of faster rcnn. Did you also remove some ROI_head layers when fine-tuning the model?
actually I removed these layers from Faster-rcnn weight 'roi_heads.box.predictor.cls_score.weight', 'roi_heads.box.predictor.cls_score.bias', 'roi_heads.box.predictor.bbox_pred.weight', 'roi_heads.box.predictor.bbox_pred.bias'
thank you a lot !
Hi @narchitect! Yeah I also used that repository to convert the dataset. I also think I did change something on faster-rcnn architecture too, I am not sure what though. It has been more than a year I did that project and I already left the lab where I was working so I cannot check on the code too! :(
@bibekyess oh I see. But it already really helped! Thanks a lot :)
@narchitect My Pleasure! :)
Sorry to bother you again, @bibekyess, but I was wondering if you could share the mAP (mean Average Precision) value for your final SGGen model
No matter how much we tweak the dataset and fine-tune the pretrained Faster R-CNN model provided in this repository, we ultimately have to remove the bbox (bounding box) layers. As a result, when training SGGen, the bbox detection layers are trained solely on our data, without the benefit of pretrained values. Consequently, our mAP doesn't seem to surpass 10%. Just to give you some context, our dataset consists of 377 similar images and 23 classes, which isn't particularly ideal.
Therefore, my conclusion is that the best SGGen model we can obtain from this repository has an mAP of around 25%. Given the subpar quality of our data, I believe achieving an mAP of 12% in fine-tuned models that require bbox detection, like SGGen, is the best we can do.
Out of curiosity, did the SGGen model you fine-tuned achieve an mAP higher than 25%?
Hi @narchitect https://github.com/narchitect! Sorry to hear about the
performance issue on your setup.
I don't remember my performance metrics values but I can say that it was
definitely good. We trained on Hope dataset and synthetic block datasets.
The Faster R-CNN detection results were very good. For scene graph
detections, my predicates were only on
and 'clear' so it was not too
complicated. You can see my demo scene-graph results on block datasets
here:
https://drive.google.com/file/d/1sSWC129c15ZNmSCHixfx1hb_Kzl7wCW6/view?usp=sharing.
I am not sure if my environment is simple (with only blocks) so the results
were good. But based on the values you shared, it seems something is wrong.
Maybe you can try with other better quality datasets?
Thank you and I hope you solve your issue soon. :)
On Fri, Jan 5, 2024, 03:36 Nayun Kim @.***> wrote:
Sorry to bother you again, @bibekyess https://github.com/bibekyess, but I was wondering if you could share the mAP (mean Average Precision) value for your final SGGen model
No matter how much we tweak the dataset and fine-tune the pretrained Faster R-CNN model provided in this repository, we ultimately have to remove the bbox (bounding box) layers. As a result, when training SGGen, the bbox detection layers are trained solely on our data, without the benefit of pretrained values. Consequently, our mAP doesn't seem to surpass 10%. Just to give you some context, our dataset consists of 377 similar images and 23 classes, which isn't particularly ideal.
Therefore, my conclusion is that the best SGGen model we can obtain from this repository has an mAP of around 25%. Given the subpar quality of our data, I believe achieving an mAP of 12% in fine-tuned models that require bbox detection, like SGGen, is the best we can do.
Out of curiosity, did the SGGen model you fine-tuned achieve an mAP higher than 25%?
— Reply to this email directly, view it on GitHub https://github.com/KaihuaTang/Scene-Graph-Benchmark.pytorch/issues/84#issuecomment-1877578591, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUV6PQEU2ZDARGG4FFDVVT3YM3ZB5AVCNFSM4RKTY732U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBXG42TOOBVHEYQ . You are receiving this because you were mentioned.Message ID: @.***>
❓ Questions and Help
Hi, I generate a new vg datasets (1750-700-400) followed this repro https://github.com/danfeiX/scene-graph-TF-release and use the scripts generate_attribute_labels.py you provide. Then, we trained the detector model based on this new datasets (1750-700-400), and test it, but the evaluation result is relatively low: we only get the 0.06 mAP. I wonder that whether this is normal phenomenon or not. Also, I observed that in this repro https://github.com/peteanderson80/bottom-up-attention, the performance is similarly low when trained on 1600-400-20 vg datasets (jump to this, https://github.com/peteanderson80/bottom-up-attention#expected-detection-results-for-the-pretrained-model, it only achieves 0.102. The author says that: mAP is relatively low because many classes overlap (e.g. person / man / guy), some classes can't be precisely located (e.g. street, field) and separate classes exist for singular and plural objects (e.g. person / people). We focus on performance in downstream tasks (e.g. image captioning, VQA) rather than detection performance. So far, i have some questions: (1) How do you explain this phenomenon? (2) Does low performance in detection have a negative effect on downstream tasks (e.g. image captioning, VQA)?