Closed Wilann closed 2 years ago
Hi @Wilann,
Here is the answers to your question :-)
The code you mention is indeed our re-implementation of CCVB by Floriane Magera from the "EVS broadcast Equipments" company. She is our collaborator on the paper and re-implemented the method herself.
CCBV is the calibration student and the commercial product is the teacher. We did not want our method to rely on a private company product, which is why we distilled it into an open-source architecture (CCBV). Otherwise, people using our method would have to buy the product to reproduce the results, which is not really fair for the scientific community. We wanted a method that is 100% open-source, so that people can try it freely on their own games for research as well. If you want to have a calibration model for badminton, you would have to retrain completly CCBV from scratch with you own annotated data (or use an available calibration algorithm).
No, the segmentation mask of the players was only used to compute the average color, but was not saved in the final json file. onfield just says if the bounding box intersects the field (hence more or less if it is a player or referee: 1 , or someone in the public: 0)
No, this was done in a separate code that I don't think is available in the devkit, as it is basically a code from one of our previous work. (See https://github.com/cioppaanthony/online-distillation/blob/master/utils/field.py)
No, this was also done in a separate code that I think was not shared. What you mentionned is something we tested but that did not really improve the results, it's basically drawing a trail behind the players (a bit like tracklets) in the final image representation. So it is not used in the command line we provide.
This is simply the part of the field that the camera sees, basically the green channel of Figure 2 (a) in the paper.
These radar images are the background used for Figure 2 (a) of the paper as well. This is just to have a reference of where the players are on the field in the representation. However, lines 68-82 do not save anything, they rather load everything in memory in the correct format.
Exactly.
Yes, it is right here: https://github.com/SilvioGiancola/SoccerNetv2-DevKit/tree/main/Task1-ActionSpotting/CALF_Calibration_GCN
They draw the bounding boxes as filled rectangles in the player's average color. he first part is when there is a calibration (hence on the top-view) and the second part is when you have no calibration (drawn on the image plane, this was just for ablation that did not make it in the paper, don't worry about it).
I hope this helps you understand better our method! :-)
Hi again @cioppaanthony,
Thank you so much for such detailed and fast responses! I have some follow up questions below:
{
"homography": [
1644.5538330078125,
-801.4969482421875,
46028.890625,
11.810320854187012,
-8.94961929321289,
22109.705078125,
-0.011232296004891396,
-0.788362979888916,
45.37401580810547
]
}
How would I annotate this homography data?
Since I wouldn't use a teacher-student approach, I believe I can delete all code related to args.teacher
- is this correct?
Also, it seems the CCBV repo doesn't have a training pipeline - I'm not familiar with CCBV at all, but would be it possible to somewhat-easily create the pipeline from the given classes?
Note: For the package I'm using, when there are 2 instances, I'm getting a tensor of shape [2, 1080, 1920]
with boolean values. Do I just have to get the RGB color of "True" values then average them?
field.py
calculates onfield
? Here are the steps I'm considering - please let me know if it makes sense or not:
Steps:
field.py
to filter bounding boxes that do not intersect the fieldWhat would I use to get the field lines?
Also, where is this player localization being used in the code?
New Questions:
What do these variables mean:
args.mode
- I don't think it's used anywhere args.feature_multiplier
- Used in the model somehow?args.calibration_physic
dim_representation_player
- What is this, and why should it be an even number? args.with_dense
args.with_dropout
- Is this just to see if dropout improves performance or not? What are the "copy" functions in the model used for?
For example, init_2DConv(...)
vs init_2DConv_copy(...)
, etc.
Thank you again for taking the time to read and answer my questions! I'm still new to many things about CALF_Calibration and CCBV, but I hope my questions made sense.
I am also interested in this question.
I also want to run train
using CALF_Calibration_GCN.
I have an external video, but I'm still investigating what else I need. (Json label?)
Hi @Wilann,
Note that args.teacher is not related to the teacher-student distillation (which is out of the scope of this repository). This argument was simply for us to try using directly the predictions of the big commercial product teacher.
That's exactly the procedure we've used! :-)
You don't need the field lines in the case of soccer, the field mask is sufficient to filter out most detections outside the field. If you have the field lines for badminton, you can use them to filter only bounding boxes intersecting your field
This is basically what's done between lines 218-297 and 328-402 of https://github.com/SilvioGiancola/SoccerNetv2-DevKit/blob/main/Task1-ActionSpotting/CALF_Calibration/src/dataset.py. The player localization is stored in representation_half_1 and representation_half_2, which is later used in the model.
mode: unused indeed, it was for experiments
features_multiplier: simply states by how much to multiply the number of features in the latent space of the original CALF(times 2 for instance). This is just to increase the size of the network for ablation purposes.
calibration_physic: In Figure 2 of the paper, to get the representation with one type of information per channel, without the player color information). This was also done for ablation purposes.
representation_player: It is the size of the square in the representation from the top view (see Figure 2 (a) of the paper). It should be even because we use this self.size_radar_point//2
when drawing the players, so just to avoid being surprised that the representation does not change between 4 and 5 for instance.
with_dense: uses a dense layer to get to the latent space of CALF rather than the pyramid module (for ablation purposes)
with_dropout: we added a dropout on the latent space for ablation purposes.
This is when we tested it with a larger resnet architecture for the input features, which required 1792 input features instead of 512. This was also only for ablation purposes.
Thank you for your questions, it made me realize that the code contains too many unused stuffs that were used for ablation and that I should clean. What would you think of a minimal working code rather than this long code? I could try to produce it when I have a bit of time if you think it would be valuable.
@DogFortune For that, you need calibration predictions from an algorithm (for instance CCBV) and player information with bounding boxes and average colors (for instance with Mask R-CNN).
Hi @cioppaanthony,
Sometimes it also identifies players on the screen in the stadium, as well as coaches when they're off the court (where there are no fielid lines). Would this impact the training process or results?
To get the field lines would I have to manually label them or use a calibration algorithm like CCBV?
calibration_physic
, do you mean Color Composite Image (False) and Binary Channel Image (True)?I've also noticed that setting self.backbone_player == "3DConv"
is the same as setting self.calibration == True
, since I've removed the second part from your comment above:
They draw the bounding boxes as filled rectangles in the player's average color. he first part is when there is a calibration (hence on the top-view) and the second part is when you have no calibration (drawn on the image plane, this was just for ablation that did not make it in the paper, don't worry about it).
Thank you for answering my questions! I've personally already made the changes to my own copy of the code (with my own dataset, removing the half2
variables, and so on), so it wouldn't benefit me much. Although if you 'd like I could make the same changes to this repo with a PR (without removing the half2
and things like that). It would be my first PR, but let me know if I could help in this sense.
Hi @Wilann,
Judging from the images you sent, I doubt that any calibration algorithm would work as it is mostly close-up shots. Calibration can be useful in soccer for instance when you have a camera that films most of the soccer field from above, where you can clearly see the lines and thus compute a proper homography. Furthermore, since in badminton you only have 2 to 4 players at the same time, I'm not sure how much relevant information can be extracted from player localization. In soccer, the advantage is that it is easier to see team formations, like defense lines for instance. i'm not sure how that would translate to badminton.
And if you don't already have a calibration algorithm for badminton, I guess the only solution is indeed to annotate the lines and train your own model (or find one that already exists for badminton). This can be really time consuming, for a potentially low increase in performance. Therefore, I'm not sure I would recommend taking this route in your case.
About the PR, don't worry, we will keep the code up to date internally at this time, thanks for proposing. :-)
Hi @cioppaanthony
The 1st frame occurs in ~20-30% of a match, and the 2nd frame doesn't occur often (as it's just a quick coaching phase, maybe ~4%). I've noticed that rallies take ~30% of a match, and start of rallies often look like this frame (below), where players are always in this formation:
I understand that classes are split into patterned and fuzzy groups, and was thinking that "Rally Start" would benefit since I could classify it as patterned (because of this starting formation), but "Rally End" would just use the vanilla CALF pipeline.
Thank you for your insights! I also currently looking into NetVLAD++ (while I run experiments with the vanilla CALF pipeline) to see if my use case can benefit from it.
Hi @Wilann,
Oh I see! Therefore yes, you're right. The separation you propose between patterned and fuzzy makes sense in your case. Good luck in your research with NetVLAD++ as well!
Hi @cioppaanthony,
That's great to hear! Thank you so much for confirming my idea - really appreciate you taking the time to help me out!
Hello SoccerNet Dev Team,
I'm currently in the process of reading your paper on CALF-Calibration, and the entire pipeline along with the results are very impressive. I have a (many) few questions on parts of the paper/code I'm confused with, and would really appreciate it if any of you could help clear up my confusion. I know I have many questions written below, and completely understand if you're unable to answer them due to the volume. Still, I would love to dig deeper into your work, and it would be amazing if you could help me do so. As always, thank you so much for your time, and I'm of course looking forward to your new discoveries!
Note: As mentioned in my previous issues, I'm trying to use action-spotting in the context of badminton.
./calibration_data/model.png
I assume I have to swap this out for a badminton court for my application. And in./calibration_data/dictionary.json
, the data format is:I assume this is the format that "Xeebra Evs" product will write the data as? And I believe the model predictions are in the
1_field_calib_ccbv.json
files?As a follow up to Q1, if you didn't use the CCBV code, have you open-sourced your student calibration algorithm?
1_player_boundingbox_maskrcnn.json
files, I seebbox
,color
andonfield
predictions. Isonfield
the image segmentation mask?src/config/radar.png
andsrc/config/model-radar-mini.png
. What are these "radar" images?