Evaluation code - Githubissues

Kanav123 commented 5 years ago

Hi,

Can you please provide the code corresponding to the three spotting baselines (watershed segment-based and NMS based) and also the tolerance based MAP evaluation code.

Thanks!

SilvioGiancola commented 5 years ago

Hi @Kanav123, Check the following jupyter notebook for the spotting baselines and the MAP evaluation code. Sorry if the code is dirty, it was not meant to be shared, but let me know if that help you. Note that you need to run the detection first (classification on sliding windows) and export the results in numpy format.

baraldilorenzo commented 4 years ago

Dear Silvio,

many thanks for sharing your code! Could you please clarify how is the mAP with tolerance computed in the code - e.g., in the notebook, where is Evaluation/Spotting_NMS_10.npy generated?

Thanks for your support, Lorenzo.

SilvioGiancola commented 4 years ago

Dear Lorenzo,

I honestly do not remember this detail. I decided to release all the details of my code for spotting in a recent push. It has a couple more jupyter notebook files that I guess will help you in finding a solution.

Also, please consider this new baseline for a better maintained code, that will be released sometime this month.

Cheers,

matteot11 commented 4 years ago

Hi,

thanks for the great work and for sharing it. I noticed the recent push, which added an "Evaluation" folder inside the "Detection" one. Since there are different .py files starting with "get_detectionperformance>" in the "Evaluation" dir, I wondered which one should be used for the mAP in spotting. Moreover, all of them make use of "labelsDelta.json" files (which I do not find in the repo) and "predictions<center/argmax/NMS>_.json" files (available in the "Results_spot" dir). Could you please release the labels files in order to run the evaluation, or clarify how can I generate them starting from the dataset's labels? If the new "context-aware-loss" repo will contain more details, you can ignore my issue and I will wait for it :)

SilvioGiancola commented 4 years ago

Hi @matteot11,

For the evaluation, I used the evaluation functions from ActivityNet, that considers temporally bounded activities. labels_Delta_.json is an ActivityNet format of each Labels.json (for each game), that consider the spotting tolerance \delta and extrapolate equivalent temporal activities. Note that the predictions_<center/argmax/NMS>.json were created in a similar fashion, defining the temporal segment with a dimension of 1 frame. I used the ActivityNet function with a tIoU>0 between labels_Delta_.json and the predictions predictions_<center/argmax/NMS>.json to output the mAP results for a given spotting tolerance \delta.

That should help you reproduce these results :)

matteot11 commented 4 years ago

Thanks for your help, I have successfully run the evaluation code using "get_detection_performance_spotting.py" and got reasonable mAP values, which are close to the ones reported in the paper, but still not equal (mine are slightly better). This may be due to the prediction file I selected for testing, among those in the "Results_Spot" dir. I have extracted ground truth events boundaries following your suggestion, using different tolerances (from 5 to 60 seconds). Now, if I understood, I have to build a .json file containing my predictions (along with their scores), then I need to compute the mAP 12 times (using the previously mentioned GT files with different tolerances, from 5 to 60), and finally, I average those mAPs. To reproduce your results, which is the correct prediction file for reproducing, for instance, the green mAP curve of Figure 3a in the paper (Segment center, 40.6 mAP)? Thanks again, I apologize for so many questions.

SilvioGiancola commented 4 years ago

Exactly like you said. You may have a slightly better results in testing as I have been pointed to an erroneous annotation in the testing set. This annotation has been corrected and could lead to an improvement up to 0.5% in the Average-mAP (AUC). How much more have you reached?

matteot11 commented 4 years ago

Unfortunately, the gap between my results and the mAP reported in the paper is much more than 0.5% (e.g. I got 52.5 instead of 49.7 average mAP when using "predictions_20_Center_50.json" which I guess are the predictions of a model trained on chunks of 20 s and with a watershed threshold of 50%). For the "labels_Delta.json" files, I checked that the total number of annotations was correct (326 goals, 453 cards, 579 substitutions), considering "own-goals" as "goals". Given a tolerance d, for each ground truth event at timestamp m:s:

I compute the center frame index as c=m*60+s
I compute the [start, end] of the interval as start=c-d/2 and end=c+d/2, keeping a float when the tolerance is an odd value (predictions are reported in 1 second resolution, and so GT files are)

What do you think could be the problem? Thanks

SilvioGiancola commented 4 years ago

That seems way too much. Have you tried considering penalties?

SilvioGiancola commented 4 years ago

Here is where I define the labels: https://github.com/SilvioGiancola/SoccerNet-code/blob/82dd12401304d57de43e089cf2e023f2018edd63/src/Classification/Dataset.py#L79 Only the "penalty-missed" are not taken into consideration.

matteot11 commented 4 years ago

Yes, considering penalties the result does not change much. Counting the event occurrences, they are exactly equal to those reported in Table 6 of the paper's Supplementary Material for the testing set, so GT parsing should be fine. The problems could be in the definition of the interval's extremes, in the evaluation code and/or in the json prediction file. Here is my evaluation code:

    labels_root = "/path/to/my/GT"
    pred_root = "/path/to/my/predictions"
    results = []
    for i, Delta in enumerate(range(60,0,-5)):
        print("------delta="+str(Delta)+"------")
        verbose = True
        ground_truth_filename = os.path.join(labels_root, "labels_Delta_" + str(Delta) + ".json")
        prediction_filename = os.path.join(pred_root, "predictions_20_Center_50.json")
        tiou_thresholds = np.linspace(0.00, 0.00, 1)
        anet_detection = ANETdetection(ground_truth_filename, prediction_filename,
                                   subset="test", tiou_thresholds=tiou_thresholds,
                                   verbose=verbose, check_status=False)    
        results.append(anet_detection.evaluate())
        print(results)
    np.save("Spotting_NMS", results)

SilvioGiancola / SoccerNet-code

Evaluation code #6