SilvioGiancola / SoccerNetv2-DevKit

Development Kit for the SoccerNet Challenge
MIT License
168 stars 39 forks source link

Lots of Questions on CALF-Calibration #33

Closed Wilann closed 2 years ago

Wilann commented 2 years ago

Hello SoccerNet Dev Team,

I'm currently in the process of reading your paper on CALF-Calibration, and the entire pipeline along with the results are very impressive. I have a (many) few questions on parts of the paper/code I'm confused with, and would really appreciate it if any of you could help clear up my confusion. I know I have many questions written below, and completely understand if you're unable to answer them due to the volume. Still, I would love to dig deeper into your work, and it would be amazing if you could help me do so. As always, thank you so much for your time, and I'm of course looking forward to your new discoveries!

Note: As mentioned in my previous issues, I'm trying to use action-spotting in the context of badminton.


  1. Section 3: Calibration Algorithm Here it says "We base our calibration on the Camera Calibration for Broadcast Videos (CCBV) of Sha et al. [38], but we write our own implementation, given the absence of usable public code". I thought the public implement is here based on mentions from #19 and #32.

  1. Section 3: Our training process Since there's not a large enough public dataset of ground-truth calibrations, it seems you needed to use a student-teacher distillation approach. Why does this require this approach? It's also mentioned that you use the "Xeebra" from Evs to obtain the pseudo-GT calibrations. And in the CCBV repo in ./calibration_data/model.png I assume I have to swap this out for a badminton court for my application. And in ./calibration_data/dictionary.json, the data format is:
    {
    {
        "posX": 0.03973018142882254,
        "posY": 68.63033722968056,
        "posZ": -15.718964999679423,
        "focalLength": 4576.520967734781,
        "pan": 7.544952667759858,
        "tilt": 77.55662442882397,
        "template_id": 0,
        "calibration": [
            4576.520967734781,
            0.0,
            960.0,
            0.0,
            4576.520967734781,
            540.0,
            0.0,
            0.0,
            1.0
        ],
        "homography": [
            4659.98895334099,
            -328.4171986412632,
            25605.797859625953,
            -60.243484492166004,
            454.8368668686411,
            40864.09122521112,
            0.1282196063258542,
            -0.9680549606981291,
            69.81988273325322
        ],
        "homography_resize": [
            621.3318481445312,
            -43.788963317871094,
            3414.106689453125,
            -8.032465934753418,
            60.64492416381836,
            5448.544921875,
            0.1282196044921875,
            -0.9680549502372742,
            69.81988525390625
        ],
        "image": "/home/fmg/sources/mmeu/data/meanshift2/gmm/dictionary-four/dict-0000.png"
    },
    ...
    }

    I assume this is the format that "Xeebra Evs" product will write the data as? And I believe the model predictions are in the 1_field_calib_ccbv.json files?

As a follow up to Q1, if you didn't use the CCBV code, have you open-sourced your student calibration algorithm?


  1. Section 3: Player localization It says for each frame, you use Mask R-CNN to obtain the bounding box, segmentation mask, and average RGB color of each detected/segmented person. When checking the 1_player_boundingbox_maskrcnn.json files, I see bbox, color and onfield predictions. Is onfield the image segmentation mask?

  1. Section 3: Player localization It then says "Then, we compute a field mask following [10] to filter out the bounding boxes that do not intersect the field, thus removing e.g. staff and detections in the public". Is this lines 233-237?

  1. Section 3: Player localization Following up on Q4, it then says "We use the homography computed by CCBV-SN to estimate the player localization on the field in real-world coordinates from the middle point of the bottom of their bounding box". Is this lines 260-273?

  1. Code Blob From #6 I gather that lines 240-257 is to transform the current frame into a top view representation? What's a "calibration cone"? Do you have an image of what it looks like?

  1. Section 4: Top view image representations From issue #32 I've gathered that lines 68-82 save the top view images, and I can just edit the save paths to keep them instead of overwriting them. You also read in images src/config/radar.png and src/config/model-radar-mini.png. What are these "radar" images?

  1. Section 4: Feature vector representations It seems that lines 89-130 load the model required to get the feature vector representations from the top view representations (depending on which backbone we want to use) and are compute & saved in lines 303-324?

  1. Section 4: Player graph representation Is this somewhere in the repo? I can't find the code where this at all.

  1. Other What do lines 276-286 and lines 289-297 do?
cioppaanthony commented 2 years ago

Hi @Wilann,

Here is the answers to your question :-)

  1. The code you mention is indeed our re-implementation of CCVB by Floriane Magera from the "EVS broadcast Equipments" company. She is our collaborator on the paper and re-implemented the method herself.

  2. CCBV is the calibration student and the commercial product is the teacher. We did not want our method to rely on a private company product, which is why we distilled it into an open-source architecture (CCBV). Otherwise, people using our method would have to buy the product to reproduce the results, which is not really fair for the scientific community. We wanted a method that is 100% open-source, so that people can try it freely on their own games for research as well. If you want to have a calibration model for badminton, you would have to retrain completly CCBV from scratch with you own annotated data (or use an available calibration algorithm).

  3. No, the segmentation mask of the players was only used to compute the average color, but was not saved in the final json file. onfield just says if the bounding box intersects the field (hence more or less if it is a player or referee: 1 , or someone in the public: 0)

  4. No, this was done in a separate code that I don't think is available in the devkit, as it is basically a code from one of our previous work. (See https://github.com/cioppaanthony/online-distillation/blob/master/utils/field.py)

  5. No, this was also done in a separate code that I think was not shared. What you mentionned is something we tested but that did not really improve the results, it's basically drawing a trail behind the players (a bit like tracklets) in the final image representation. So it is not used in the command line we provide.

  6. This is simply the part of the field that the camera sees, basically the green channel of Figure 2 (a) in the paper.

  7. These radar images are the background used for Figure 2 (a) of the paper as well. This is just to have a reference of where the players are on the field in the representation. However, lines 68-82 do not save anything, they rather load everything in memory in the correct format.

  8. Exactly.

  9. Yes, it is right here: https://github.com/SilvioGiancola/SoccerNetv2-DevKit/tree/main/Task1-ActionSpotting/CALF_Calibration_GCN

  10. They draw the bounding boxes as filled rectangles in the player's average color. he first part is when there is a calibration (hence on the top-view) and the second part is when you have no calibration (drawn on the image plane, this was just for ablation that did not make it in the paper, don't worry about it).

I hope this helps you understand better our method! :-)

Wilann commented 2 years ago

Hi again @cioppaanthony,

Thank you so much for such detailed and fast responses! I have some follow up questions below:

  1. I see, so I would have to get my own annotated data. How would I begin to do so? I see the data format is in:
    {
    "homography": [
        1644.5538330078125,
        -801.4969482421875,
        46028.890625,
        11.810320854187012,
        -8.94961929321289,
        22109.705078125,
        -0.011232296004891396,
        -0.788362979888916,
        45.37401580810547
    ]
    }

    How would I annotate this homography data?

Since I wouldn't use a teacher-student approach, I believe I can delete all code related to args.teacher - is this correct?

Also, it seems the CCBV repo doesn't have a training pipeline - I'm not familiar with CCBV at all, but would be it possible to somewhat-easily create the pipeline from the given classes?


  1. Here's how I'm planning on computing the average color - does it make sense? Steps:
    • Use Mask R-CNN to get segmentation mask of each player
    • Average RGB color of all points within the keypoints

Note: For the package I'm using, when there are 2 instances, I'm getting a tensor of shape [2, 1080, 1920] with boolean values. Do I just have to get the RGB color of "True" values then average them?


  1. Following up on Q3, so field.py calculates onfield? Here are the steps I'm considering - please let me know if it makes sense or not: Steps:
    • Use Mask R-CNN to get player bounding boxes
    • Use ___ to get field lines (not sure what I would use)
    • Use field.py to filter bounding boxes that do not intersect the field

What would I use to get the field lines?


  1. Following up on Q4 (and after I actually train the CCBV model), how would I use homography from CCBV to estimate player localization on the field (from the middle point of the bottom of their bounding box - aka mostly their shoes)?

Also, where is this player localization being used in the code?


New Questions:

  1. What do these variables mean:

    • args.mode - I don't think it's used anywhere
    • args.feature_multiplier - Used in the model somehow?
    • args.calibration_physic
    • dim_representation_player - What is this, and why should it be an even number?
    • args.with_dense
    • args.with_dropout - Is this just to see if dropout improves performance or not?
  2. What are the "copy" functions in the model used for? For example, init_2DConv(...) vs init_2DConv_copy(...), etc.


Thank you again for taking the time to read and answer my questions! I'm still new to many things about CALF_Calibration and CCBV, but I hope my questions made sense.

DogFortune commented 2 years ago

I am also interested in this question.

I also want to run train using CALF_Calibration_GCN. I have an external video, but I'm still investigating what else I need. (Json label?)

cioppaanthony commented 2 years ago

Hi @Wilann,

  1. What you a showing (the camera calibration json) is not groundtruth data for training CCBV, but the predictions of CCBV. I'm not sure what Floriane Magera (author of the CCBV code) used to train the network. I would suggest you raise an issue directly on CCBV's github for her to see your question, same for the training pipeline which I unfortunately don't have.

Note that args.teacher is not related to the teacher-student distillation (which is out of the scope of this repository). This argument was simply for us to try using directly the predictions of the big commercial product teacher.

  1. That's exactly the procedure we've used! :-)

  2. You don't need the field lines in the case of soccer, the field mask is sufficient to filter out most detections outside the field. If you have the field lines for badminton, you can use them to filter only bounding boxes intersecting your field

  3. This is basically what's done between lines 218-297 and 328-402 of https://github.com/SilvioGiancola/SoccerNetv2-DevKit/blob/main/Task1-ActionSpotting/CALF_Calibration/src/dataset.py. The player localization is stored in representation_half_1 and representation_half_2, which is later used in the model.

  4. mode: unused indeed, it was for experiments features_multiplier: simply states by how much to multiply the number of features in the latent space of the original CALF(times 2 for instance). This is just to increase the size of the network for ablation purposes. calibration_physic: In Figure 2 of the paper, to get the representation with one type of information per channel, without the player color information). This was also done for ablation purposes. representation_player: It is the size of the square in the representation from the top view (see Figure 2 (a) of the paper). It should be even because we use this self.size_radar_point//2 when drawing the players, so just to avoid being surprised that the representation does not change between 4 and 5 for instance. with_dense: uses a dense layer to get to the latent space of CALF rather than the pyramid module (for ablation purposes) with_dropout: we added a dropout on the latent space for ablation purposes.

  5. This is when we tested it with a larger resnet architecture for the input features, which required 1792 input features instead of 512. This was also only for ablation purposes.

Thank you for your questions, it made me realize that the code contains too many unused stuffs that were used for ablation and that I should clean. What would you think of a minimal working code rather than this long code? I could try to produce it when I have a bit of time if you think it would be valuable.

@DogFortune For that, you need calibration predictions from an algorithm (for instance CCBV) and player information with bounding boxes and average colors (for instance with Mask R-CNN).

Wilann commented 2 years ago

Hi @cioppaanthony,

  1. In badminton, there are line judges all around the court, which Mask R-CNN often picks up on, for example: frame_9652

Sometimes it also identifies players on the screen in the stadium, as well as coaches when they're off the court (where there are no fielid lines). Would this impact the training process or results? frame_3048

To get the field lines would I have to manually label them or use a calibration algorithm like CCBV?

  1. For calibration_physic, do you mean Color Composite Image (False) and Binary Channel Image (True)?

I've also noticed that setting self.backbone_player == "3DConv" is the same as setting self.calibration == True, since I've removed the second part from your comment above:

They draw the bounding boxes as filled rectangles in the player's average color. he first part is when there is a calibration (hence on the top-view) and the second part is when you have no calibration (drawn on the image plane, this was just for ablation that did not make it in the paper, don't worry about it).

Thank you for answering my questions! I've personally already made the changes to my own copy of the code (with my own dataset, removing the half2 variables, and so on), so it wouldn't benefit me much. Although if you 'd like I could make the same changes to this repo with a PR (without removing the half2 and things like that). It would be my first PR, but let me know if I could help in this sense.

cioppaanthony commented 2 years ago

Hi @Wilann,

Judging from the images you sent, I doubt that any calibration algorithm would work as it is mostly close-up shots. Calibration can be useful in soccer for instance when you have a camera that films most of the soccer field from above, where you can clearly see the lines and thus compute a proper homography. Furthermore, since in badminton you only have 2 to 4 players at the same time, I'm not sure how much relevant information can be extracted from player localization. In soccer, the advantage is that it is easier to see team formations, like defense lines for instance. i'm not sure how that would translate to badminton.

And if you don't already have a calibration algorithm for badminton, I guess the only solution is indeed to annotate the lines and train your own model (or find one that already exists for badminton). This can be really time consuming, for a potentially low increase in performance. Therefore, I'm not sure I would recommend taking this route in your case.

  1. Yes exactly.

About the PR, don't worry, we will keep the code up to date internally at this time, thanks for proposing. :-)

Wilann commented 2 years ago

Hi @cioppaanthony

The 1st frame occurs in ~20-30% of a match, and the 2nd frame doesn't occur often (as it's just a quick coaching phase, maybe ~4%). I've noticed that rallies take ~30% of a match, and start of rallies often look like this frame (below), where players are always in this formation: frame_6096

I understand that classes are split into patterned and fuzzy groups, and was thinking that "Rally Start" would benefit since I could classify it as patterned (because of this starting formation), but "Rally End" would just use the vanilla CALF pipeline.

Thank you for your insights! I also currently looking into NetVLAD++ (while I run experiments with the vanilla CALF pipeline) to see if my use case can benefit from it.

cioppaanthony commented 2 years ago

Hi @Wilann,

Oh I see! Therefore yes, you're right. The separation you propose between patterned and fuzzy makes sense in your case. Good luck in your research with NetVLAD++ as well!

Wilann commented 2 years ago

Hi @cioppaanthony,

That's great to hear! Thank you so much for confirming my idea - really appreciate you taking the time to help me out!