Future Hand Prediction: Is the mask multiplied by the prediction we submit?

masashi-hatano commented 2 years ago

I tried submitting a json file, which follows the specified format, and I obtained the quantitative result as follows.

{"L_MDisp": 211.9670281732144, "R_MDisp": 276.70152097706693, "L_CDisp": 207.87675657595398, "R_CDisp": 271.26115030262525, "Total": 967.8064560288606}

However, even though the results that we tested in the validation dataset were better than the baseline, the results obtained from the actual submissions have a huge amount of errors. This is probably because the mask is not multiplied by the prediction we submit. The mask is used so that the error is zero on frames in which hand is not visible.

To demonstrate that the quantitative results presented above are anomalous, here are a prediction list, which is a part of my submission.json file, and its visualization result.

As you can see these figures, the quantitative results obtained from the actual submission seem to be incorrect, and the reason for this is thought to be that the loss is calculated without multiplying the predictions by the masks.

@VJWQ Could you please confirm that the loss calculation is done correctly? In particular, I would appreciate it if you could check if the process is done to set the error to zero if the hands are not in frames.

"2152_3837": [120.74557495117188, 84.73670959472656, 235.1125030517578, 93.4263687133789, 118.04257202148438, 86.06185150146484, 230.081787109375, 91.89846801757812, 125.53624725341797, 88.14488220214844, 230.46359252929688, 94.43958282470703, 122.34292602539062, 88.79545593261719, 225.5665740966797, 91.5564193725586, 122.0747299194336, 94.3060531616211, 217.82423400878906, 99.28343963623047]

003838 Figure1 pre_45 frame 003853 Figure2 pre_30 frame 003868 Figure3 pre_15 frame 003883 Figure4 pre_frame 003898 Figure5 contact_frame

VJWQ commented 2 years ago

hi @masashi-hatano happy to have you as our participant! Things run as expected on my side, for your information our baseline also gives 1*20 non-zero prediction results for sample "21523837", which means it should be fine to have non-zero predictions on frames without hands. As we mentioned on the challenge page, **"Our evaluation script won't penalize your algorithm if it gives predictions on frames without hands."_ I suggest you revisit our sample evaluation code to understand how does our metrics work. Specifically, you'll see how we filter out the out-of-frame hand situation in L80 to make sure it won't influence the submission. Please also make sure to use our script generate_submission.py to generate the submission file, and don't forget to take care of the _numclips=30** argument. Feel free to ask if you still got blocked!! Happy to help :)

masashi-hatano commented 2 years ago

@VJWQ Thanks for your reply! I solved this problem by using num_clips=30, and evaluation was done correctly.

But, I don't really understand why num_clips is needed. According to the sample evaluation code, num_clips is used just for dividing the predicted values. I would appreciate if you could give me some explanation about it.

VJWQ commented 2 years ago

@VJWQ Thanks for your reply! I solved this problem by using num_clips=30, and evaluation was done correctly.

But, I don't really understand why num_clips is needed. According to the sample evaluation code, num_clips is used just for dividing the predicted values. I would appreciate if you could give me some explanation about it.

Sure, please refer to the explanation. the number 30 is obtained from the line cfg.TEST.NUM_ENSEMBLE_VIEWS * cfg.TEST.NUM_SPATIAL_CROPS, which is an operation for better testing the robustness of the model. In short, we need to /30 when generating the submission file to obtain the average performance on each test clip.

takfate commented 2 years ago

@VJWQ In generate_submission.py. num_clips seems not to be used.

takfate commented 2 years ago

@masashi-hatano @VJWQ Hello, I also have some questions. I want to know what is the loss for validation when you training the baseline code. I append the code

    for key in pred_dict:
        pred_dict[key] = pred_dict[key] / num_clips

after the multi-view accumulation. But I also get the eval results similar to the results in your first comment. I doubt if there is a problem with my data sampling.

masashi-hatano commented 2 years ago

@takfate Hello, It probably helps you. If you evaluate your model by using generate_submission.py and eval.py, the predicted value will be divided by num_clips twice, so it may solve your problem if you remove either of them.

takfate commented 2 years ago

@masashi-hatano I use generate_submission.py to generate a submission file for the test set and submit it to EvalAI evaluation system. Will the EvalAI evaluation system do another division by 30?

VJWQ commented 2 years ago

@masashi-hatano I use generate_submission.py to generate a submission file for the test set and submit it to EvalAI evaluation system. Will the EvalAI evaluation system do another division by 30?

hi @takfate, your results will not be divided twice. In generate_submission.py, num_clips is just a placeholder and does not really do division on your results. This file serves to sum all 30 prediction results for one clip, and the actual division happens in our evaluation script after you submit your results json file in which /30 helps to obtain the average results for each clip. However, you still need to run python tools/generate_submission.py /path/to/output.pkl 30 to generate the submission file correctly. @masashi-hatano @takfate Do you mind posting the commands you use to generate the submission file? I can have a look at them to see why you are receiving similar results and adjust our guidance accordingly.

masashi-hatano commented 2 years ago

If so, it's fine for me, thanks though.

takfate commented 2 years ago

@VJWQ @masashi-hatano Our eval results are already normal. Thank you for your help.

EGO4D / forecasting

Future Hand Prediction: Is the mask multiplied by the prediction we submit? #19