Labeling "kind-of" visible actions for Task1-ActionSpotting

SilvioGiancola / SoccerNetv2-DevKit

Development Kit for the SoccerNet Challenge

MIT License

168 stars 39 forks source link

Labeling "kind-of" visible actions for Task1-ActionSpotting #31

Closed Wilann closed 2 years ago

Wilann commented 2 years ago

Hi again,

In certain scenarios, I'm sure there are cases when the moment an action occurs, it's not visible, but then after let's say 2s, the action is still in the process of occuring, and then it becomes visible. In these cases, would you label the action as "visible" or "not shown"? An idea I had was to label it "not shown" first, then after 2s when the action is visible, label the same action again but this time as "visible".

Does the visibility label impact the training at all? Or just the metric calculation?

Looking forward to your response. Thank you for reading!

cioppaanthony commented 2 years ago

Hi @Wilann,

Great question!

In our CALF method, the visibility only affects the metric, not the training. But that's an information the people could use for training their method (for instance in the context-aware loss, just giving out ideas).

I would rather say that if you only want the visible actions, you should only annotate them, but definitely not annotate twice the same action. The reason is that it will be hard for the network to say for each action that it predicts, if the action already started off-camera a few seconds before and hence if it needs to put a second timestamps a few seconds before. I would rather say that you should have only one single timestamp per action. What you could do, is only keep the visible timestamp out of the two if you prefer.

In our case, we defined the actions at exact moments (for instance for the goal, the moment the ball crosses the line). Therefore, we could not annotate it after, even a few seconds. In the case of unshown actions, we simply tried to guess the best moment to put the timestamp.

But again, you could define the annotation process another way then what we did for your own project. :-)

Wilann commented 2 years ago

Hey @cioppaanthony,

Alright, so the important thing is to only annotate once per action.

I'd say for my use case, I'd like my model to mainly (let's say +99% of the time) get the visible actions. I've also been interested in Task2-CameraShotSegmentation. My original idea was to use Action Spotting + Camera Shot Segmentation to extract the visible actions in a video, with the assumption that the visibility of an action is highly correlated with which camera is being used at that moment. Now that you've recommended to keep only the visible timestamp (since that's what I'm interested in), do you think I would even need to tackle Camera Shot Segmentation for my use case?

Another approach I can think of, is to label visible and unshown actions, just incase in the future I want the unshown actions, and because it doesn't take that much more time (in my use case, there doesn't seem to be that many unshown actions - maybe ~1% in the entire video) . Then during training, I can filter out the unshown actions. In this approach, I don't think I'd need Camera Shot Segmentation either - is this correct?

Thank you for the feedback! What do you think of my 2 approaches?

cioppaanthony commented 2 years ago

Hi @Wilann,

It actually depends on whether you want to extract video clips containing a single camera, or only the frame of the action (I'm not sure you mentioned which one you want, could you please specify it?). In the first case, you will need to know where the video cuts are since you need to properly cut your clip. In the second case, you don't need the camera information since the frame will be extracted no matter what. Now, if you only want, let's say, the free-kicks seen from the main camera, then you will obviously need the camera information.

Indeed, the visibility attribute might be linked to the type of camera. For instance it is expected that when it's the main camera, most actions are visible, but not guaranteed.

If it does not take that much more time to label the unshown actions, I would go for it! It's always better to have extra information, even if you don't use it afterwards. And if you don't annotate them and you realize that you need them afterwards, it will take much more time as you will have to go back through the entire video once more.

Wilann commented 2 years ago

Hi @cioppaanthony,

To clarify, I would like to extract the frame of the action, regardless of which camera is comes from. I believe this is the second case you mentioned.

What I worry about, is that for example in Tennis (I'm using Tennis as an example since I'm not so familiar with Soccer), let's say I want to extract the frames where a rally begins. The thing is, sometimes a rally can begin and the video still shows a replay of the previous rally. Here, I would like to extract the frame of the start of the rally, but when it's actually visible (not during the replay shot), regardless of which camera is showing it (main camera, side camera, player close-up, etc.).

Building off my previous comment, I believe I have 2 options:

Use Action Spotting + Camera Shot Segmentation as they're intended in your repo, then filter frames where an action is detected + the camera used is a main/side/player close-up/etc.
Only use Action Spotting, but I "double-label" first when the action is unshown, then again when it becomes visible. I would then only train with the labels that are visible, which would have camera information embedded due to my labeling decisions. (The unshown labels are just incase I need them in the future, since as you pointed out, it would be very time consuming to go back and re-label the entire video).

Wilann commented 2 years ago

As a follow-up question, in the paper, section 3.1 Encoding, it says

Time-shift encoding (TSE) for temporal segmentation. ... (2) Just before an action, its occurrence is uncertain. Therefore, we do not influence the score towards any particular direction.

This corresponds to the segmentation scores that are 0 everywhere except the moment an action occurs, where it's then a 1. This means that if I forget to label a goal, this is okay, since the network won't learn "no goal" - it just won't learn anything, and the only downside is one less label in my dataset. Is this correct?

cioppaanthony commented 2 years ago

Hi @Wilann,

If you're only interested in the frames, I would then start with the visible action. This will give you a first large subset of action frames.

If you really want them all, including the frames when the action is visible after being unshown, then as you say, you might do it automatically with the camera information (reminder that it is only available for 200 out of the 500 games).

In the second option you propose, if you use our annotator, you can directly navigate through the unshown actions and add a timestamp when the action is visible again (or the main camera for instance).

Be careful that in your follow-up question, you say that the segmentation score is 0 before a goal. This is not correct, we actually set the segmentation LOSS to 0, which is very different. Setting the loss to 0 says that you don't care what the network will predict there (whether it is a 1 or a 0 or anything in-between). If you forget to label a goal, it depends if you mean in general a general context (not near an unshown action) then no, since the segmentation will be FORCED towards 0 (not the loss which won't be 0 here, really the final score). But if you mean by forgetting to re-annotate a goal visible right after a goal unshown, then indeed, the loss won't change that much in that context if both timestamps are close, and hence it won't be crucial in training.

Wilann commented 2 years ago

Hi @cioppaanthony,

So in summary, if I double label (either unshown-visible or visible-unshown),

If using Task1 only, use visible label
If using Task1 & Task 2, use whichever label comes first (result will be same as this repo)

In my follow-up question, I was thinking of a in a general context. So because the segmentation score (?) will be forced towards 0, it's actually quite bad if a goal label is forgotten. I'm acrtually a little confused about why the segmentation score is forced towards 0. Is the reason is very technical and is explained in the paper? I actually didn't understand the parts that were heavy in math. Although in my "double-label" context, I understand why it's not critical in training, since the network will still learn the action (as time-wise normally the two labels would be quite close to each other), it just won't have visibility embedded in it's prediction (I believe?).

cioppaanthony commented 2 years ago

Hi @Wilann,

Yes indeed there is a clear explanation don't worry. If you forget to label one instance (completely forget to label it, neither shown or unshown tags are present), then you will be in the "Definitely no action" (red part) of Figure 2 in the paper. This means that the loss will force the segmentation score towards 0 even though there is an action occurring. This can be disruptive for the network

This is to put in contrast with the fact that when you have for instance an unshown action, and then you forget to label the shown action coming right after as you suggest. There the "missed" visible action timestamp will be in the "action occurred" (green part), and hence it won't really matter if you missed it as the segmentation score will be pushed correctly towards 1 anyway. This is the same kind of reasoning as if you miss the unshown action, right before the visible action. The missed unshown action will be within the "Possible Action" (grey part), where the LOSS is 0. This means that in this zone, you do not force the network to predict 0, or 1 as segmentation score. The network can predict whatever it wants without being penalized. This is what I mean by not forced towards a particular value.

Then, as you say, the visibility tag in CALF is not used during training or inference, only for the evaluation of the metric. But that's something you could integrate in your own network if you want to.

I hope I answered your question, don't hesitate to get back to me if there is still something unclear.

Wilann commented 2 years ago

Hi @cioppaanthony,

First of all appologies for the (very) late reply. I've been struggling a lot with my dataset and setting up my hyperparameter tuning pipeline such that it's taken all of my headspace.

Regarding your explanation, I think I understand now why completely missing a label is really bad, and just missing either the "visible" or "not shown" label in my double-label pair is alright.

So basically (bold is the label we use, normal is the label we forget to label),

unshown-shown
- Shown label in green "Action Occurred" section - segementation score pushed towards 1
unshown-shown
- Unshown label in grey "Possible Action" section - loss 0, segmentation score not pushed at all

I'm still confused as to which part is responsible for all this. I believe the TSE uses K to decided these 6 "zones", which is used by the Temporal Segmentation Loss (?) to decide all these "segmentation score pushing towards 1", "not pushed at all", "loss is 0", etc .

As a side note, I think I'll have to take a deeper dive into your work (comparing paper to code) to really understand it inside and out.

Wilann commented 2 years ago

As a follow up (?) question regarding hyperparameter optimization (section 4.1), it seems there are 2 sets of optimizations:

f, r, alpha_2, beta, and lambda^seg
K_i^c

A few questions:

f is apparently "the number of frame features extracted per class" - which varaible is this in the code?
I believe alpha_2 is loss_weight_segmentation and beta is loss weight_detection?
What does lambda^seg correspond to?
In the paper there's:

K_1 = -40 (resp. -40, -80)

What does "resp" mean?

Also, for Average-mAP I'm guessing its range is in [0, 1], 0 being good and 1 being bad (of course, with respect to the delta_step and number of intervals as discussed in my other issue #30)

Also, how many GPUs were required, and which models were they?

cioppaanthony commented 2 years ago

Hi @Wilann,

Sorry for the delay, I was out of office last week.

Regarding your questions:

Yes, The K parameters are just used to define the segments and then we define the loss differently in each segment (Equations 1-6 in the paper).
f corresponds to dim_capsule, which is simply a predefined number of features per class, nothing fancy.
alpha_2 is only used in the spotting loss for the coordinate prediction, it corresponds to lambda_coord in the code. beta is lambda_noobj in the code and corresponds to the weight of the loss for detection when there is no predictions to make.
lambda_seg really is the weight of the whole segmentation loss compared to the detection loss
resp. means respectively, and here it is the value of K1 for the goal class (respectively the card and substitution class). So here K1_goal = -40, K1_card = -40 and K1_substitution=-80.
The average_mAP is indeed comprised between [0,1], but it is the other way around, 0 is bad and 1 is perfect ;-)
Only One GPU for the training and inference. I was able to train the network on a 1080TI, on a RTX 2070 Super Max-Q (laptop) and a Tesla V100.

I hope I answered your questions clearly, have a nice day :-)

Wilann commented 2 years ago

Hi @cioppaanthony,

Thank you so much again for your response. I'm really learning a lot here. I really appreciate you taking the time to answer my questions!

Training CALF on my dataset (with 5000 labels for 2 actions/classes), I'm getting around a_mAP=0.6-0.7 with delta_step=2 and num_intervals=12 (after tuning hyperparameters similar to your paper's method). Even with a decent a_mAP, most predictions have a low confidence <0.4 and the ones above 0.4 are incorrect. Do you have any idea what the issue is? Maybe I need more data? I checked your paper for SoccerNet (v1) and saw your team had ~6,600 labels for 3 classes, so I was under the impression that I had enough.

After figuring this out, I would like to improve this by quite a bit. I understand there are many models in this dev kit for action spotting and camera shot segmentation, which I would love to dig into. I'm not concerned how many different models are necessary to achieve a good result, but how would you combine them to improve the Average-mAP? I'm thinking of tackling CALF for camera shot segmentation next, then combining it with this CALF for action spotting with some sort of FC-layer combination at the end, but I'm not sure. Do you have any recommendations for my next steps?

Thanks again, and take care yourself :)

cioppaanthony commented 2 years ago

Hi @Wilann,

In our case, it was 6600 annotations for 3 classes, and around 150.000 for 17 classes. Si indeed, it seems that for 2 classes 5000 labels should be sufficient. The next question is how many games do you have for these 5000 labels?

Regarding the metric, if you changed the steps, than you cannot compare anymore the final a_mAP with our performance. For the computation of our mAP, we have deltas of [5,10,15,...,60].

I believe that you should start with the basic CALF network and try to make it work before going any further. Just to make sure everything works properly before adding new layers and stuff. Could you please share a plot with the ground-truth and your predictions so I can see better the direction you would need to take to improve your predictions ? :-)

Wilann commented 2 years ago

Hi @cioppaanthony,

I have 42 matches/videos for my ~5000 annotations. My classes are "Rally Start" and "Rally End". Since rallies make up badminton, this is why I don't have many videos when compared to SoccerNet. I was considering to split the videos in games as you have done for SoccerNet (then I could have more variety of matches for the same number of labels), but it would be strange to check the dataset since badminton matches are best 2 out of 3. I like keeping things tidy and even. Maybe it was a poor decision on my end?

I see. It makes sense to start simple and make sure it works first. I was just worried maybe CALF wouldn't work so well for badminton (even though you mentioned it should generalize to other sports well), so not sure if I should've continued with labeling & hyperparameter tuning. (I'm currently still labeling and tuning more)

Here are 2 graphs for a single video, comparing GT labels and the model's predictions - one for "Rally Start" and another for "Rally End" (you'll need to zoom-in a lot to see the times): gt_labels_vs_preds, class=Rally Start Note: The a_mAP for both classes are around 0.5 in this run

Thank you for suggesting to checkout my model's predictions :)

Edit: If it helps, here's some graphs of the training process:

Screenshot from 2021-10-28 09-59-04

cioppaanthony commented 2 years ago

Hi @Wilann,

Alright, I just have a few questions to go on:

Can you please remind me the deltas that you use for your metric ? From what I see, your predictions can be spaced as close as 15 seconds. Therefore, I would use a tight average_mAP metric with small deltas to evaluate your performance. For instance, you could use deltas=[1,2,3,4,5] or deltas=[1,2,3,4,5,6,7,8,9,10].
Also, what do the red line (and especially their length) represent in your first graph?
From what I see, you still manage to have good predictions around the ground-truth spots, but you also have false positives, and that's what bothers you right ? :-) If so, you may also try to increase the value of lambda_noobj, which will restrict the prediction of false positives (but maybe also some true positives, there is a trade-off).

Wilann commented 2 years ago

Hi @cioppaanthony,

I used delta_step=2 and num_intervals=12 in those graphs. I see. I'll try delta_step=1 and num_intervals=5 next (I believe these are the values corresponding to deltas=[1,2,3,4,5])
Actually, I just realized my graphing function may be incorrect (I'm not very proficient in matplotlib, and had to ask a question on StackOverflow to get my function). I should've noticed this before sending you the image. Let me make sure it's correct first. The red lines, which are at different heights, were supposed to be for visibility (so the graph isn't too cluttered).
Yes, the false positives (mostly in the middle of the video inbetween many labels) are bothering me! Based on these graphs, it seems my model is also missing a lot of predictions at the end of the video. But again, let me make sure my graph is correct first. Appologies for potentially sending a wrong graph. For this run specifically, I believe frame_rate=10, lambda_coord=2.4, lambda_noobj=1.609, loss_weight_segmentation=0.06971, and receptive_field=66. Will keep in mind that I should try increasing lambda_noobj a bit.

Wilann commented 2 years ago

Alright, so I was incorrect with my graphing function, and have fixed it (I think). Following the hyperparameter tuning proceedure in the paper (I'm trying to atleast), I have 2 runs that have quite different predictions (when checking visually in the graph), but their a_mAP were quite good/similar. With these runs, I'm trying to tune context slicing parameters, and the other hyperparameters (like chunk_size) are the same as the paper. Again, the red lines are different lengths just so the graph isn't too cluttered. Also, I've only plotted the predictions with confidence > 0.4.

I suppose my main issue is how the a_mAP can be good/similar but when checking the graphs they're vastly different.

Run 1: Decent

frame_rate = 5
lambda_coord = 4.053
lambda_noobj = 0.1119
loss_weight_segmentation = 0.04271
receptive_field = 84

gt_labels_vs_preds, class=Rally Start

Screenshot from 2021-10-28 14-45-25

Screenshot from 2021-10-28 14-45-39

Run 2: Bad

frame_rate = 10
lambda_coord = 2.4
lambda_noobj = 1.609
loss_weight_segmentation = 0.06971
receptive_field = 66

gt_labels_vs_preds, class=Rally Start

Screenshot from 2021-10-28 14-46-59

Screenshot from 2021-10-28 14-47-18

cioppaanthony commented 2 years ago

Hi @Wilann,

On possibility is that the a_mAP is not really suited for your application and data (but it should be given its nature). Otherwise, you can maybe try to find another metric that is more in line with your final objective. However, what really surprises me is that you have around 65% a_mAP in the second case. Looking at your predictions, you have a lot of false negatives,.I'm really surprised that the metric is so high with so much false negatives. You should probably check that the metric computation is done correctly. For instance, try putting by hand some TP, FP and FN and check if the numbers computed in the code for these values make sense.

It may also depends on how many games you have in your testset. if you have too few games, then the metric will not really reflect the statistical performance and might be biased by the few games you test it on.

I also see that you tried different framerate, which feature extractor did you use for your videos ? Make sure that when you change the framerate, the labels are still correctly placed in the batches for segmentation and detection (it should be the case with the basecode of CALF).

Wilann commented 2 years ago

Hi @cioppaanthony,

I'm using ResNet152, but with PyTorch. I inspected the extracted features with numpy and they seem to be good. For example, these are outputs for a video extracted at fps=10 with length 45:47:

>>> feat.shape
(27473, 2048)
>>> feat[:50, 100]
array([0.25912586, 0.10281312, 0.17623998, 0.31068757, 0.36932987,
       0.6386598 , 0.30357587, 0.5376102 , 0.12990904, 0.17219543,
       0.30925706, 0.23336571, 0.16362879, 0.3703359 , 0.2105888 ,
       0.10780776, 0.08646448, 0.10057698, 0.09228317, 0.10970358,
       0.09294576, 0.19812953, 0.26739508, 0.33140934, 0.17614576,
       0.34459007, 1.329996  , 0.73833776, 0.28698385, 0.21361865,
       0.34945264, 0.4840516 , 0.5557907 , 0.43425715, 0.4496161 ,
       0.37916058, 0.58650434, 0.7809327 , 0.6177029 , 0.45498002,
       0.61750317, 0.73414   , 0.25752613, 0.17276669, 0.40049818,
       0.24729109, 0.32669747, 0.40315184, 0.3875398 , 0.3730135 ],
      dtype=float32)
>>> feat_pca.shape
(27473, 512)
>>> feat_pca[:50, 100]
array([ 0.67958355, -0.01081022, -0.41926676, -1.1407776 , -0.11782354,
       -0.17993724,  0.02105877,  1.0736198 ,  0.5956228 , -0.5200749 ,
        0.03562607, -0.1283502 , -0.15107583, -0.5896329 , -0.36711633,
       -0.13544214, -0.12002121,  0.14902562,  0.24453345,  0.7745506 ,
        1.0626502 ,  0.2881042 , -0.9140971 , -1.017564  , -1.2318994 ,
       -1.1789461 , -1.8663653 , -1.1376855 , -1.1062744 , -0.7513017 ,
       -0.830442  , -0.16196132,  0.11632629, -0.28599405, -0.5110555 ,
        0.31296253,  0.10932852, -0.57404655, -1.0925784 , -0.7976085 ,
       -0.22735938, -0.91415447,  0.03423226, -1.3002524 , -0.02556402,
       -0.10176659,  0.0097625 ,  0.47112215, -0.08023313, -0.09160286],
      dtype=float32)

When checking the stem graphs from all (4) videos in the test set for "Run 2", they're all similar - does this means the metric is not biased? If so, I suppose I may have made a mistake when editing the a_mAP calculation (when adding variables for delta_step and num_intervals - will check on this as well as with manual values. Also will check that the labels are correctly placed in batches for different frame rates. For the second issue, could you please let me know which part of the code I should check? Is it SoccerNetClips and SoccerNetClipsTesting?

Edit: I can confirm that the averaging is correct for the a_mAP, so I'll check the calculation from mAP instead. From Run 2 (Bad) (Rally Start: 0.6442, Rally End: 0.6305): Screenshot from 2021-10-29 09-16-08

Wilann commented 2 years ago

Hi again @cioppaanthony,

After a lot of debugging, I think I came across the reason why my metric wasn't accurate. I think it's because sometimes I was using K, chunk_size and receptive_field instead of K*frame_rate, chunk_size*frame_rate and receptive_field*frame_rate respectively. I've changed the code so that everytime it needs one of these variables, I pass it the later versions. It seems to have worked, since the metric seems to be accurate now.

(Alternatively, did the code intend to use K*frame_rate, chunk_size and receptive_field instead?`

Surprisingly, this made another error come up. In the training dataset class, I found that class_selection = random.randint(0, self.num_classes) was getting classes 0-2 inclusive (while I believe it should've been from 0-1). I've therefore changed it to class_selection = random.randint(0, self.num_classes-1).

I then found another issue when accumulating the match anchors for each class in this block:

for anchor in anchors:
    self.match_anchors[anchor[2]].append(anchor)

Since the anchor was sampled as anchor = self.match_anchors[class_selection][event_selection][1] in the __getitem__, it was interfering with start = anchor + shift, since anchor[1] was usually an int, but could sometimes be a list. I fixed this by mkaing sure anchor[1] was an int before appending the anchor.

Finally, the initial part of my __getitem__ now looks like this - I changed the indexes slightly with - 1s so that indexing errors don't occur (I'm not sure where they've even come from):

class_selection = random.randint(0, self.num_classes-1)
event_selection = random.randint(0, len(self.match_anchors[class_selection]) - 1)
game_index = self.match_anchors[class_selection][event_selection][0] - 1
anchor = self.match_anchors[class_selection][event_selection][1]

Although the metric seems to be accurate now, in my hyperparameter tuning, I'm consistently getting results like this, no matter which hyperparameters I try:

(Notice how the losses drop very quickly)

Screenshot from 2021-11-18 07-41-43

media_images_20190518-20190529 TOTAL BWF Sudirman Cup 2019, 20190519 - G1 _ XD _ HOKI_NAGAHARA (JPN) vs ALIMOV_DAVLETOVA (RUS) _ BWF 2019, gt_labels_vs_preds for Rally End_34_f684d5fa3f55d7270c4e

I'm not sure what the cause of this is, but I'm currently looking into the Context Aware Model, general pyramid models, etc. to learn and hope to eventually change the model myself to fit it better to my task.

I know this has been a long thread, but I hope you can provide a few more insights to why the training may be acting this way. Thank you for reading!

cioppaanthony commented 2 years ago

Hi @Wilann! Sorry about my late reply, I did not see your previous comment.

Regarding the multiplication by the framerate, everything should already be done in the original code (see for instance https://github.com/SilvioGiancola/SoccerNetv2-DevKit/blob/main/Task1-ActionSpotting/CALF/src/dataset.py#L39 https://github.com/SilvioGiancola/SoccerNetv2-DevKit/blob/main/Task1-ActionSpotting/CALF/src/main.py#L28) I'm not sure where you needed to put it again.

Regarding the number of classes, it is better if you change directly the variable here: https://github.com/SilvioGiancola/SoccerNetv2-DevKit/blob/main/Task1-ActionSpotting/CALF/src/dataset.py#L37 and here: https://github.com/SilvioGiancola/SoccerNetv2-DevKit/blob/main/Task1-ActionSpotting/CALF/src/dataset.py#L170 since the variable is also used in the evaluation, creation of the model etc. This might be the issue causing the out of range value. You should have 2 classes in your case.

Finally, regarding self.match_anchors[anchor[2]].append(anchor), I'm not sure you should modify anything there. The issue might come from the previous change of the number of classes only. This only specifies that we sample per class the anchors (for class balancing). In the getitem, I don't think any modification are needed either.

Could you please try the suggested modifications and tell me if you manage to get better results ?

Wilann commented 2 years ago

Hi @cioppaanthony,

No worries. I'm just thankful you're replying at all.

After reverting my changes to dataset.py and writing an explanation for a list of errors I was encountering, I realized that it all stemed from enumerating my list of matches starting from 1 instead of 0.

The training seems to actually vary now with each run. Sorry for all the trouble. The issue was never with the original code to begin with, but with a simple counting issue from my end.

Regarding the training now, my validation loss/metric are different with each run, but my training loss is still decreasing very rapidly in the beginning and quickly leveling out. What do you think is the cause of this? Of course, more data is always better, but do you think that would be a good solution?

For my dataset, I now have 76 matches, with ~9500 labels between my 2 classes. I've also horizontally flipped all the videos (while using the same labels, and re-computing ResNet features) to try and augment my dataset. This gives me 152 matches with ~19000 labels. I've added this option to my hyperparameter tuning code. I'm currently using a 70-20-10 split for training-validation-testing.

An exmaple of a run now - training loss always follows this trend: Screenshot from 2021-11-19 09-09-38

I see this as the model fitting well to the training set, but failing to generalize. Would you also say so?

Wilann commented 2 years ago

@cioppaanthony I realized that my questions here are starting to divert from SoccerNet and CALF in general. Appologies for posting them here. Thank you so much for your help until now in getting the pipeline to work for my dataset - I really appreciate it! Looking forward to your new research discoveries in the future! :)

cioppaanthony commented 2 years ago

Hi @Wilann,

Sorry about the late replies lately, it's been a few very busy weeks. :-)

Regarding the training, I believe that you should be able to train your model with the data you have. At least get first correct performances before annotating more data.

I guess that right now it's really a matter of fine-tuning the training parameters, and possibly checking for possible errors that might have slipped in the code.

I wish you the best of luck in your work!