ajabri / videowalk

Repository for "Space-Time Correspondence as a Contrastive Random Walk" (NeurIPS 2020)
http://ajabri.github.io/videowalk
MIT License
266 stars 38 forks source link

Low performance with pretrained.pth #32

Closed lorenmt closed 3 years ago

lorenmt commented 3 years ago

Hi,

I recently run your pre-trained model on davis 2017 with the exact same command you listed in the readme. python test.py --filelist /path/to/davis/vallist.txt \ --model-type scratch --resume ../pretrained.pth --save-path /save/path \ --topk 10 --videoLen 20 --radius 12 --temperature 0.05 --cropSize -1

However, the final performance based on the official davis evaluation script is not as good as the one claimed in the paper. What I got is around 61 for J&F-Mean. Specifically, the detailed performance is listed as below:

J&F-Mean   J-Mean  J-Recall  J-Decay   F-Mean  F-Recall  F-Decay
 0.614429 0.584634  0.686656 0.225137 0.644223  0.763603 0.256438

---------- Per sequence results for val ----------
            Sequence   J-Mean   F-Mean
      bike-packing_1 0.496049 0.711096
      bike-packing_2 0.685996 0.752332
         blackswan_1 0.934492 0.973339
         bmx-trees_1 0.301675 0.770057
         bmx-trees_2 0.644392 0.845591
        breakdance_1 0.666383 0.676260
             camel_1 0.747073 0.855923
    car-roundabout_1 0.852337 0.714172
        car-shadow_1 0.807822 0.778809
              cows_1 0.920527 0.956957
       dance-twirl_1 0.549648 0.593753
               dog_1 0.851405 0.867017
         dogs-jump_1 0.302670 0.435166
         dogs-jump_2 0.536664 0.599638
         dogs-jump_3 0.788082 0.822245
     drift-chicane_1 0.729466 0.786235
    drift-straight_1 0.526541 0.528944
              goat_1 0.800556 0.734920
         gold-fish_1 0.721810 0.717445
         gold-fish_2 0.659471 0.700005
         gold-fish_3 0.820182 0.845394
         gold-fish_4 0.848312 0.915238
         gold-fish_5 0.879084 0.878996
    horsejump-high_1 0.773536 0.888244
    horsejump-high_2 0.723407 0.944909
             india_1 0.631993 0.592968
             india_2 0.567645 0.560544
             india_3 0.629983 0.627841
              judo_1 0.760509 0.765048
              judo_2 0.749010 0.756075
         kite-surf_1 0.270090 0.267305
         kite-surf_2 0.004306 0.062131
         kite-surf_3 0.093566 0.127047
          lab-coat_1 0.000000 0.000000
          lab-coat_2 0.000000 0.000300
          lab-coat_3 0.000000 0.000000
          lab-coat_4 0.000000 0.000000
          lab-coat_5 0.000000 0.000000
             libby_1 0.803691 0.920149
           loading_1 0.900133 0.875399
           loading_2 0.383891 0.567959
           loading_3 0.682442 0.716217
       mbike-trick_1 0.571612 0.743456
       mbike-trick_2 0.639744 0.669962
    motocross-jump_1 0.340788 0.395740
    motocross-jump_2 0.519756 0.554731
paragliding-launch_1 0.819913 0.923513
paragliding-launch_2 0.645564 0.885479
paragliding-launch_3 0.034370 0.137811
           parkour_1 0.805982 0.893970
              pigs_1 0.812613 0.764461
              pigs_2 0.617975 0.750136
              pigs_3 0.906452 0.882834
     scooter-black_1 0.389385 0.669319
     scooter-black_2 0.722495 0.675855
          shooting_1 0.270579 0.454346
          shooting_2 0.747166 0.661882
          shooting_3 0.753406 0.872043
           soapbox_1 0.785921 0.778360
           soapbox_2 0.647941 0.710407
           soapbox_3 0.586195 0.741657

I am wondering whether this is the expected performance without test time adaptation? Or could you list the detailed step-by-step procedure so we can reproduce the results more easily?

Thanks.

ajabri commented 3 years ago

Hi @lorenmt,

The sequence of commands are provided in https://github.com/ajabri/videowalk#davis. After you run test.py, you need to run the following commands, the latter of which calls the official davis2017-evaluation repository:

# Convert
python eval/convert_davis.py --in_folder /save/path/ --out_folder /converted/path --dataset /davis/path/

# Compute metrics
python /path/to/davis2017-evaluation/evaluation_method.py \
--task semi-supervised   --results_path /converted/path --set val \
--davis_path /path/to/davis/

Is this what you did to obtain your results above?

ajabri commented 3 years ago

Also, IIRC, the davis-2017 and davis2017-evaluation repositories expect the inference output file names to be indexed differently (0 index v.s. 1 index).

So, if you use davis-2017, you should change line 63 in my convert_davis.py script code from 'j' to 'j+1'.

lorenmt commented 3 years ago

Hi Allen,

Thanks for your quick reply. I revised the raw results with own code below:

file_list = os.path.join('dataset/DAVIS', 'ImageSets', '2017/val.txt')
videos = []
with open(os.path.join(file_list), "r") as frame_set:
    for f in frame_set:
        _video = f.rstrip('\n')
        videos.append(_video)

palette = np.loadtxt('palette.txt', dtype=np.uint8).reshape(-1, 3)

for i in range(30):
    a = glob.glob('dataset/davis_corr/{}_*_mask.png'.format(i))
    create_folder('dataset/davis_videowalk/{}'.format(videos[i]))

    for k in range(len(a)):
        im = Image.open('dataset/davis_corr/{}_{}_mask.png'.format(i, k))
        im = np.array(im)
        label = np.unique(im.reshape(-1, 3), axis=0)

        im_ = np.zeros((im.shape[0], im.shape[1]), dtype=np.uint8)
        for kk in range(len(label)):
            mask = np.float16(label[kk] == im)
            mask = mask.sum(-1) == 3
            im_[mask] = kk

        im = Image.fromarray(im_)
        im.putpalette(palette.ravel())
        im.save('dataset/davis_videowalk/{}/{:05d}.png'.format(videos[i], k))

That would reorganize the raw results into the davis format. And here is what I obtained which you can download here: https://www.dropbox.com/sh/1cr85dyxeeptk0k/AACBoXYIo2noUMFWBHZ8HD0-a?dl=0

And finally I ran python evaluation_method.py --task semi-supervised --results_path ../../dataset/davis_reco which produces the performance I provided in the first comment.

From the visual results I obtained, it actually indeed look worse compared to the results you put in the video. But I would be really grateful if you could futher check into this, and to see whether the generated results are ok?

Best,

Additional note: I am quite confident that the evaluation script is correct. Since I used the same script evaluating STM, and I got the same reported performance.

lorenmt commented 3 years ago

Futher note: I am sorry that I found the lab-coat index is wrong (after 8x down-sampling, two objects are completely disappeared). I will fix the issue and rerun the script and update the result here.

lorenmt commented 3 years ago

Hello, here are the updated result:

--------------------------- Global results for val ---------------------------
 J&F-Mean  J-Mean  J-Recall  J-Decay   F-Mean  F-Recall  F-Decay
 0.657716 0.62911  0.735837 0.223777 0.686321  0.812783 0.269499

---------- Per sequence results for val ----------
            Sequence   J-Mean   F-Mean
      bike-packing_1 0.496049 0.711096
      bike-packing_2 0.685996 0.752332
         blackswan_1 0.934492 0.973339
         bmx-trees_1 0.301675 0.770057
         bmx-trees_2 0.644392 0.845591
        breakdance_1 0.666383 0.676260
             camel_1 0.747073 0.855923
    car-roundabout_1 0.852337 0.714172
        car-shadow_1 0.807822 0.778809
              cows_1 0.920527 0.956957
       dance-twirl_1 0.549648 0.593753
               dog_1 0.851405 0.867017
         dogs-jump_1 0.302670 0.435166
         dogs-jump_2 0.536664 0.599638
         dogs-jump_3 0.788082 0.822245
     drift-chicane_1 0.729466 0.786235
    drift-straight_1 0.526541 0.528944
              goat_1 0.800556 0.734920
         gold-fish_1 0.721810 0.717445
         gold-fish_2 0.659471 0.700005
         gold-fish_3 0.820182 0.845394
         gold-fish_4 0.848312 0.915238
         gold-fish_5 0.879084 0.878996
    horsejump-high_1 0.773536 0.888244
    horsejump-high_2 0.723407 0.944909
             india_1 0.631993 0.592968
             india_2 0.567645 0.560544
             india_3 0.629983 0.627841
              judo_1 0.760509 0.765048
              judo_2 0.749010 0.756075
         kite-surf_1 0.270090 0.267305
         kite-surf_2 0.004306 0.062131
         kite-surf_3 0.093566 0.127047
          lab-coat_1 0.000000 0.000000
          lab-coat_2 0.000000 0.000000
          lab-coat_3 0.932124 0.895322
          lab-coat_4 0.914726 0.837048
          lab-coat_5 0.866172 0.835881
             libby_1 0.803691 0.920149
           loading_1 0.900133 0.875399
           loading_2 0.383891 0.567959
           loading_3 0.682442 0.716217
       mbike-trick_1 0.571612 0.743456
       mbike-trick_2 0.639744 0.669962
    motocross-jump_1 0.340788 0.395740
    motocross-jump_2 0.519756 0.554731
paragliding-launch_1 0.819913 0.923513
paragliding-launch_2 0.645564 0.885479
paragliding-launch_3 0.034370 0.137811
           parkour_1 0.805982 0.893970
              pigs_1 0.812613 0.764461
              pigs_2 0.617975 0.750136
              pigs_3 0.906452 0.882834
     scooter-black_1 0.389385 0.669319
     scooter-black_2 0.722495 0.675855
          shooting_1 0.270579 0.454346
          shooting_2 0.747166 0.661882
          shooting_3 0.753406 0.872043
           soapbox_1 0.785921 0.778360
           soapbox_2 0.647941 0.710407
           soapbox_3 0.586195 0.741657

Now the results look similar to your reported one, 2 percentage lower in J&F-Mean. So is this pre-trained performance is without online adaptation. And if with online adaptation, are we expected to reach 67 J&F-Mean? Thanks!

ajabri commented 3 years ago

No, the result reported (67.6 J&F-Mean) is without online adaptation. So there still seems to be a gap...

I am not sure where it is coming from but will have a chance to investigate in the next few days; the only difference seems to be using your raw conversion code v.s. the code I've provided in convert_davis.py.

lorenmt commented 3 years ago

Hi Allan,

I think I found the mistake. Again it's from the object index error, when some frames only predict partial objects, the indices are not mapped correctly. I did a sanity check to reevaluate with your convert_davis.py code, and I got 67.4 for J&F-mean, and it's within a reasonable range for uncertainty. Really sorry from my mistake and thank for your time and reply.