ML-KULeuven / socceraction

Convert soccer event stream data to SPADL and value player actions using VAEP or xT
MIT License
611 stars 136 forks source link

Question Regarding VAEP Scoring and Conceding Labeling" #767

Closed GunHeeJoe closed 2 months ago

GunHeeJoe commented 2 months ago

Hello,

I am a researcher deeply engaged in studying the VAEP framework. I have been exploring various ways to enhance VAEP, such as using ANN instead of CatBoost and addressing the imbalance in scoring and conceding. My interest in VAEP is profound, and I have a question regarding its implementation. This is the link : https://github.com/GunHeeJoe/VAEP

The data I am using for my research includes all seasons of LaLiga. However, I have encountered an issue when labeling the scoring and conceding events. There are instances where both labels are marked as True. For example, in the following case:

As shown in the attached image, A team scores due to event 78200 (shot), and then B team scores at event 78206 (shot), causing the conceding labels for events 78197, 78198, 78199, and 78200 (shot) to also be True.

Isn't this problematic? VAEP predicts the probabilities of scoring and conceding within the next 10 actions following a particular event. If a shot results in a goal at event 78200, it should be considered a new possession from the next event onwards. Therefore, the conceding labels for events 78197, 78198, 78199, and 78200 (shot) should be False, and the labeling should proceed from subsequent events. X_data Y_data

To address this, I have written some code and would appreciate it if you could review it for reference. label.py code : https://github.com/GunHeeJoe/VAEP/blob/main/socceraction/vaep/labels.py Thank you very much for your time and consideration.

madestro commented 2 months ago

Hello @GunHeeJoe, I'm also a researcher interested in VAEP but I have also a background in football managing (and in football manager until 2006 ;-) ). If I understand well VAEP and probabilities, the events of scoring and conceding are independents, therefore you can have P(score) + P(conceding) > 1.0

probberechts commented 2 months ago

This is patched in the implementation of the VAEP formula:

https://github.com/ML-KULeuven/socceraction/blob/b6943d20f7c512b981d06e0ebdad5aa1bc26c400/socceraction/vaep/formula.py#L54-L58

GunHeeJoe commented 2 months ago

Hello @madestro,

Thank you for your insightful response.

I understand that scoring and conceding probabilities are treated independently, which means P(score)+P(concede)>1.0 is possible. However, I am concerned about some cases where this might lead to incorrect evaluations. For example, if a team scores and then quickly concedes, it seems the actions leading to the goal might also be considered as leading to the conceding event. Similarly, if a team scores right after the second half begins, wouldn't the actions from the end of the first half also contribute to the scoring event? Based on the VAEP framework, each action's influence on scoring and conceding probabilities is calculated independently. Therefore, actions contributing to a scoring event should not simultaneously be labeled as contributing to a conceding event. I believe this is to ensure the accuracy and independence of each action's evaluation.

Could you please confirm if this understanding is correct? Your expertise would be greatly appreciated. Thank you.

GunHeeJoe commented 2 months ago

Hello @probberechts

Thank you for your response. I did not consider the post-processing in the VAEP formula. Your explanation is much appreciated. This approach resolves the issue of not considering the goal probability of the previous shot after a kickoff.

However, I am concerned that this does not address the problem during the training phase, as incorrect labels might still be used for learning. Do you think my concern is valid? Or, as @madestro mentioned, since scoring and conceding are treated independently, is it possible for labels to be assigned simultaneously?

While post-processing ensures the accuracy of VAEP metrics, it seems that during training, the labels might still be incorrect. I would appreciate your insights on this matter.

Thank you.

madestro commented 2 months ago

Hello @GunHeeJoe , from the football point of view, it could happen more than you think. If a team it is not well mentally prepared (or they are very young) after a goal they lost the concentration, so it is probable that a goal could lead to a conceding. Regards! Javier

probberechts commented 2 months ago

The task that is solved during the training phase is to estimate the probability of scoring and conceding in the near future for the team in possession. Note that the "near future" is not a possession but rather a window that is defined by the next k actions. That is because the probability of scoring or conceding does not suddenly drop to zero after a turnover (a team can counterpress). k is a user-defined parameter that represents how far ahead in the future we look to determine the effect of an action. In the original paper, we chose k = 10.

As @madestro pointed out, the assumption is indeed made that the events of scoring and conceding are independent. The task therefore simplifies to two separate binary probabilistic classification problems with identical inputs but different labels. So, with the current setup, I really do not see how this could be problematic.

Even if you would like to estimate these probabilities in a dependent way, I don't see why it would be a problem. A team can both score and concede a goal in the next 10 actions, as your example illustrates. So, you are modeling the reality.

Obviously, you don't want to penalize a player for scoring a goal (even though the penalty will be very small because it almost never happens that a team concedes 10 actions after scoring) because the opponent might score from the successive goal-kick. Therefore, the probabilities are reset in the VAEP formula after a goal.

GunHeeJoe commented 2 months ago

Hello, @probberechts @madestro Thank you both for your responses. Although the term "independent" was a bit unclear to me, as @madestro mentioned, it makes sense to consider situations where a team concedes a goal due to a lack of mental preparedness after scoring. Additionally, I now understand better given that the concept of possession is not introduced.

The reason for my concern was that I approached VAEP from a multi-class classification perspective. Specifically, I considered three categories: scoring, conceding, and situations where neither scoring nor conceding occurred. Therefore, I needed to address scenarios where both scoring and conceding are labeled simultaneously. Thank you for your thoughtful responses.