large performance gap between paper and code

valencebond commented 2 years ago

thanks for your codes. I try to reimplement the performance reported by the paper. But the performance is far from the paper reported after I run 5 times. Is the "tally_major result" the same as the majority results in Table 1？ Please correct me if I misunderstand sth.

major result CF1 0.3003 CP 1.0000 CR 0.1767 OF1 0.3003 OP 1.0000 OR 0.1767 mAP 0.6256 total_task0 0.0000 CF1 0.7421 CP 1.0000 CR 0.5900 OF1 0.7421 OP 1.0000 OR 0.5900 mAP 0.9426 total_task3 0.0000 CF1 0.5238 CP 0.8078 CR 0.3875 OF1 0.5326 OP 0.8516 OR 0.3875 mAP 0.7208 total_task1 0.0000 CF1 0.7672 CP 0.7852 CR 0.7500 OF1 0.7543 OP 0.7587 OR 0.7500 mAP 0.8588 total_task2 0.0000 Name: task 2, dtype: float64

moderate result CF1 0.2722 CP 0.7930 CR 0.1643 OF1 0.2706 OP 0.7667 OR 0.1643 mAP 0.5187 total_task0 0.0000 CF1 0.4882 CP 0.9807 CR 0.3250 OF1 0.4887 OP 0.9848 OR 0.3250 mAP 0.7482 total_task3 0.0000 CF1 0.3689 CP 0.8067 CR 0.2391 OF1 0.3779 OP 0.9007 OR 0.2391 mAP 0.5945 total_task1 0.0000 CF1 0.7332 CP 0.9186 CR 0.6100 OF1 0.7318 OP 0.9143 OR 0.6100 mAP 0.8392 total_task2 0.0000 Name: task 2, dtype: float64

minor result CF1 0.2349 CP 0.5814 CR 0.1471 OF1 0.2455 OP 0.7410 OR 0.1471 mAP 0.3489 total_task0 0.0000 CF1 0.1962 CP 0.7674 CR 0.1125 OF1 0.1978 OP 0.8182 OR 0.1125 mAP 0.4524 total_task3 0.0000 CF1 0.4559 CP 0.7925 CR 0.3200 OF1 0.4434 OP 0.7218 OR 0.3200 mAP 0.6103 total_task1 0.0000 CF1 NaN CP NaN CR NaN OF1 NaN OP NaN OR NaN mAP NaN total_task2 0.0000 Name: task 2, dtype: float64

OF1 result total_task0 0.2669 total_task3 0.5040 total_task1 0.4271 total_task2 0.7440 Name: task 2, dtype: float64

CF1 result apple 0.4557 baseball bat 0.0198 bear 0.2456 bench 0.1651 bicycle 0.4545 bird 0.2759 book 0.0000 bottle 0.0000 bowl 0.2203 couch 0.2381 dining table 0.2113 donut 0.0769 fork 0.0917 horse 0.0392 kite 0.3443 orange 0.4559 parking meter 0.4776 potted plant 0.0385 refrigerator 0.3750 sandwich 0.2564 sink 0.3780 spoon 0.2836 teddy bear 0.3710 train 0.5185 total_task0 0.2653 bed 0.4580 broccoli 0.3548 bus 0.6667 carrot 0.1239 clock 0.7879 elephant 0.7730 handbag 0.0571 hot dog 0.1481 scissors 0.1818 stop sign 0.8506 suitcase 0.1651 surfboard 0.3740 toothbrush 0.4186 zebra 0.9189 total_task3 0.4973 backpack 0.0189 baseball glove 0.4733 cake 0.2301 car 0.1239 chair 0.0381 cow 0.3871 fire hydrant 0.6443 laptop 0.4586 motorcycle 0.5915 oven 0.5767 remote 0.2105 sports ball 0.3051 toilet 0.8136 traffic light 0.5211 truck 0.7459 tv 0.2344 umbrella 0.2609 vase 0.4000 total_task1 0.4215 airplane 0.8877 banana 0.6207 boat 0.8000 cat 0.6456 dog 0.5875 frisbee 0.7463 giraffe 0.9263 pizza 0.8588 sheep 0.8543 skateboard 0.6603 skis 0.7000 snowboard 0.5625 tennis racket 0.8229 tie 0.6667 total_task2 0.7563 Name: task 2, dtype: float64

mAP result total_task0 0.4825 total_task3 0.7192 total_task1 0.6252 total_task2 0.8490 Name: task 2, dtype: float64

tally_major result CF1 0.6022 CP 0.6831 CR 0.5383 OF1 0.5147 OP 0.4931 OR 0.5383 mAP 0.6377 totaltally 0.0000 Name: task 2, dtype: float64

tally_moderate result CF1 0.4120 CP 0.6912 CR 0.2934 OF1 0.3925 OP 0.5925 OR 0.2934 mAP 0.4552 totaltally 0.0000 Name: task 2, dtype: float64

tally_minor result CF1 0.2642 CP 0.5459 CR 0.1743 OF1 0.2658 OP 0.5596 OR 0.1743 mAP 0.2868 totaltally 0.0000 Name: task 2, dtype: float64

tally_CF1 result totaltally 0.4423 Name: task 2, dtype: float64

tally_OF1 result totaltally 0.4127 Name: task 2, dtype: float64

tally_mAP result totaltally 0.4685 Name: task 2, dtype: float64

forget result CF1Overall 0.35372261562102864 CF1_0 0.4904616006556799 CF1_1 0.27642293484724645 CF1_3 0.29428331136015945 CF1majorOverall 0.3408271732806232 CF1minorOverall 0.15211235516275715 CF1moderateOverall 0.39675175120031597 OF1Overall 0.3497952979742733 OF1_0 0.4871944752232707 OF1_1 0.27139853393865215 OF1_3 0.2907928847608969 OF1majorOverall 0.32959176158324816 OF1minorOverall 0.14256616526827945 OF1moderateOverall 0.39864362977761414 mAPOverall tensor(0.0789) mAP_0 tensor(0.1326) mAP_1 tensor(0.0456) mAP_3 tensor(0.0586) mAPmajorOverall 0.08887466228694463 mAPminorOverall 0.05185369816195057 mAPmoderateOverall 0.08762516767391099 major_CF1_0 0.5716168014675809 major_CF1_1 0.25158519223865417 major_CF1_3 0.19927952613563446 major_OF1_0 0.5654887262401326 major_OF1_1 0.22524211408806835 major_OF1_3 0.1980444444215435 major_mAP_0 0.20362007326459786 major_mAP_1 0.04043275085429519 major_mAP_3 0.022571162741940837 minor_CF1_0 0.08054947667998652 minor_CF1_1 -0.17110301932770253 minor_CF1_3 0.5468906081359876 minor_OF1_0 0.03922893725520471 minor_OF1_1 -0.1464687840678032 minor_OF1_3 0.5349383426174368 minor_mAP_0 -0.007253305817728102 minor_mAP_1 0.04555895847392395 minor_mAP_3 0.11725544182965583 moderate_CF1_0 0.5292452918689221 moderate_CF1_1 0.34169235379399854 moderate_CF1_3 0.3193176079380274 moderate_OF1_0 0.5264706376282585 moderate_OF1_1 0.35202899710574737 moderate_OF1_3 0.3174312545988366 moderate_mAP_0 0.15258383903715045 moderate_mAP_1 0.0478182141402538 moderate_mAP_3 0.062473449844328705 total_task2 tensor(0.2309) Name: task 2, dtype: object

Progressing to Task 4

valencebond commented 2 years ago

what is the difference between test and teorg settings ? thanks

cdjkim commented 2 years ago

Hello!

Firstly, thanks for your passionate! interest in our work. We ran the experiments again and were able to reproduce the experiment you enquired about (to be specific, got results even better than those reported on the paper!).

Here are some things we ask you to carefully try:

Please refer to our paper, and try running the experiments (keeping in mind Figure 6 for choice of q_poa hyperparameters).
Make sure to follow the env setting instructions as detailed here

For your second question, the difference between "test" and "teorg" is that while "test" uses a balanced test dataset (this is the evaluation set used in Table 1), "teorg" maintains the imbalanced distribution in the test setting as well.

Sincerely, Chris

valencebond commented 2 years ago

thanks for your quick reply! I follow the readme file exactly. i notice that the best performance achieves when q_poa = 0.1 as shown in Fig. 6, however, the q_poa in config file is -0.03 https://github.com/cdjkim/PRS/blob/136cee1863af03cc914dc05dfd41bda8b7bc0bf2/code/configs/mlab_prs-coco.yaml#L35

Should I change this setting?

valencebond commented 2 years ago

it is very kind of you to provide the config.yaml of your experiments

valencebond commented 2 years ago

one more question, what is the difference between results_tally_map with results_map. If I understand right, the reported performance is based on results_xx, not results_tally_xx.

cdjkim commented 2 years ago

results_{mAP}.csv contains the performance of the previous tasks after each consecutive task finishes training. resultstally{mAP}.csv contains the tally (overall) performance up to each finished training task.

If the files are causing confusion, we recommend looking at the summary.txt only.

The Overall performance reported in Table 1 corresponds to the tally_{CF1, OF1, mAP} results in the summary.txt, and the same goes for the majority, moderate, minority performances (data.py could help you further clarify this).

Chris

valencebond commented 2 years ago

results_{mAP}.csv contains the performance of the previous tasks after each consecutive task finishes training. resultstally{mAP}.csv contains the tally (overall) performance up to each finished training task.

If the files are causing confusion, we recommend looking at the summary.txt only.

The Overall performance reported in Table 1 corresponds to the tally_{CF1, OF1, mAP} results in the summary.txt, and the same goes for the majority, moderate, minority performances (data.py could help you further clarify this).

Chris

thanks for your suggestion !

cdjkim / PRS

large performance gap between paper and code #15