PySport / kloppy

kloppy: standardizing soccer tracking- and event data
https://kloppy.pysport.org
BSD 3-Clause "New" or "Revised" License
328 stars 55 forks source link

Creating support for expected goals and game state values in our kloppy data model #280

Open DriesDeprest opened 6 months ago

DriesDeprest commented 6 months ago

Expected goals By adding an optional xg attribute to our ShotEvent class, we can support the widely used expected goal property in kloppy. This property could be fed by the raw input data during deserialization (e.g. StatsBomb) or in a later stage could be calculated by the user using an xG model of choice.

Proposed implementation:

@dataclass(repr=False)
@docstring_inherit_attributes(Event)
class ShotEvent(Event):
    """
    ShotEvent
    Attributes:
        event_type (EventType): `EventType.SHOT` (See [`EventType`][kloppy.domain.models.event.EventType])
        event_name (str): `"shot"`,
        result_coordinates (Point): See [`Point`][kloppy.domain.models.pitch.Point]
        result (ShotResult): See [`ShotResult`][kloppy.domain.models.event.ShotResult]
        xg (ExpectedGoal): See [`ExpectedGoal`][kloppy.domain.models.event.ExpectedGoal]
    """
    result: ShotResult
    result_coordinates: Point = None
    event_type: EventType = EventType.SHOT
    event_name: str = "shot"
    xg: Optional[ExpectedGoal]

@dataclass
class ExpectedGoal:
    """
    Expected goal metrics of an event

    Attributes:
         xg: The probability of scoring from the shot situation, not considering shot execution characteristics
         execution_xg: The probability of scoring following the execution of the shot
         gk_difficulty_xg: The probability of a goalkeeper conceding a goal
    """

    xg: Optional[float] = field(default=None)
    execution_xg: Optional[float] = field(default=None)
    gk_difficulty_xg: Optional[float] = field(default=None)

    @property
    def net_shot_execution(self) -> Optional[float]:
        return None if None in (self.xg, self.execution_xg) else self.execution_xg - self.xg

Game state values By adding optional gs_scoring_before, gs_scoring_after, gs_conceding_before and gs_conceding_after attributes to our Event class, we can support the widely used game state based value models in kloppy. This property could be fed by the raw input data during deserialization (e.g. StatsBomb's on-the-ball value models) or in a later stage could be calculated by the user using a game state value model of choice (e.g. VAEP).

Proposed implementation:

@dataclass
@docstring_inherit_attributes(DataRecord)
class Event(DataRecord, ABC):
    """
    Abstract event baseclass. All other event classes inherit from this class.

    Attributes:
        event_id: identifier given by provider
        team: See [`Team`][kloppy.domain.models.common.Team]
        player: See [`Player`][kloppy.domain.models.common.Player]
        coordinates: Coordinates where event happened. See [`Point`][kloppy.domain.models.pitch.Point]
        raw_event: Dict
        state: Dict[str, Any]
        qualifiers: See [`Qualifier`][kloppy.domain.models.event.Qualifier]
    """

    event_id: str
    team: Team
    player: Player
    coordinates: Point

    result: Optional[ResultType]
    gsv: Optional[GameStateValue]

    raw_event: Dict
    state: Dict[str, Any]
    related_event_ids: List[str]

    qualifiers: List[Qualifier]

    freeze_frame: Optional["Frame"]

@dataclass
class GameStateValue:
    """
    Game state value metrics of an event.

    Attributes:
         gsv_scoring_before (Optional[float]): The probability the team will score in X actions prior to the event.
         gsv_scoring_after (Optional[float]): The probability the team will score in X actions after the event.
         gsv_conceding_before (Optional[float]): The probability the team will concede a goal in X actions before the event.
         gsv_conceding_after (Optional[float]): The probability the team will concede a goal in X actions after the event.
    """

    gsv_scoring_before: Optional[float] = field(default=None)
    gsv_scoring_after: Optional[float] = field(default=None)
    gsv_conceding_before: Optional[float] = field(default=None)
    gsv_conceding_after: Optional[float] = field(default=None)

    @property
    def gsv_scoring_net(self) -> Optional[float]:
        return None if None in (self.gsv_scoring_before, self.gsv_scoring_after) else self.gsv_scoring_after - self.gsv_scoring_before

    @property
    def gsv_conceding_net(self) -> Optional[float]:
        return None if None in (self.gsv_conceding_before, self.gsv_conceding_after) else self.gsv_conceding_after - self.gsv_conceding_before

    @property
    def gsv_total_net(self) -> Optional[float]:
        if None in (self.gsv_scoring_before, self.gsv_scoring_after, self.gsv_conceding_before, self.gsv_conceding_after):
            return None
        return (self.gsv_scoring_after - self.gsv_scoring_before) - (self.gsv_conceding_after - self.gsv_conceding_before)

Any feedback is highly welcome!

probberechts commented 6 months ago

A few thoughts:

  1. I would also add xA, xT, execution ratings, decision ratings, win probability, pitch control, pitch influence, and pressing intensity. 😄 But jokes aside, my main point is that if we create a separate field for each metric, things could get pretty complex and it might quickly explode. I think it's a better idea to use a single list, dict or custom container to store all metrics.
  2. I would attach this container for metrics to the DataRecord class since it is also possible to compute metrics for tracking data frames (e.g., pitch control).
  3. I suppose we want a base class for metrics and a few subclasses.
from typing import Optional, Dict, Union, List
from dataclasses import dataclass, field
from abc import ABC, abstractmethod
import numpy as np

@dataclass
class Metric(ABC):
    name: str
    provider: Optional['Provider'] = None

@dataclass
class ScalarMetric(Metric):
    value: float

@dataclass
class PlayerMetric(Metric):
    value: Dict['Player', float]

@dataclass
class SurfaceMetric(Metric):
    value: np.ndarray

    def value_at(self, loc : Point):
        return value[loc.y, loc.x]

Then, you can define classes for the most common metrics as

class ExpectedGoals(ScalarMetric):
    """Expected goals""""
    name = "xG"

class PostShotExpectedGoals(ScalerMetric):
     """"Post-shot expected goals"""
     name = "PsXG"

class GameStateValue(ScalarMetric):
     """Game state value""""
    gsv_scoring_before: Optional[float] = field(default=None)
    gsv_scoring_after: Optional[float] = field(default=None)
    gsv_conceding_before: Optional[float] = field(default=None)
    gsv_conceding_after: Optional[float] = field(default=None)

    @property
    def gsv_scoring_net(self) -> Optional[float]:
        return None if None in (self.gsv_scoring_before, self.gsv_scoring_after) else self.gsv_scoring_after - self.gsv_scoring_before

    @property
    def gsv_conceding_net(self) -> Optional[float]:
        return None if None in (self.gsv_conceding_before, self.gsv_conceding_after) else self.gsv_conceding_after - self.gsv_conceding_before

    @property
    def value(self) -> Optional[float]:
        if None in (self.gsv_scoring_before, self.gsv_scoring_after, self.gsv_conceding_before, self.gsv_conceding_after):
            return None
        return (self.gsv_scoring_after - self.gsv_scoring_before) - (self.gsv_conceding_after - self.gsv_conceding_before)
  1. I am not sure whether "metric" is the right terminology here. In the context of soccer analysis, a metric typically involves the aggregation or analysis of multiple data points. For example, if you are tracking the number of goals scored by a soccer player in each game, each individual game's goal count would be a data point. If you calculate the average number of goals scored per game over a season, that average becomes a metric. To make this distinction, I prefer to use "statistic" in the context of a single data point.
DriesDeprest commented 6 months ago
  1. Good point 😅. I agree that adding a list of Statsitics is probably a better way to keep it clean and still have a lot of flexibility in adding statistics.

  2. Agree.

  3. Makes sense!

  4. Fine to use "statistic" in this terminology.

Below, an updated version of how the DataRecord class would change, based on your inputs:

@dataclass
class DataRecord(ABC):
    """
    DataRecord

    Attributes:
        dataset: Reference to the dataset this record belongs to.
        prev_record: Reference to the previous DataRecord.
        next_record: Reference to the next DataRecord.
        period: See [`Period`][kloppy.domain.models.common.Period]
        timestamp: Timestamp of occurrence.
        ball_owning_team: See [`Team`][kloppy.domain.models.common.Team]
        ball_state: See [`BallState`][kloppy.domain.models.common.BallState]
        statistics: List of Statistics associated with this record.
    """

    dataset: "Dataset" = field(init=False)
    prev_record: Optional["DataRecord"] = field(init=False)
    next_record: Optional["DataRecord"] = field(init=False)
    period: "Period"
    timestamp: float
    ball_owning_team: Optional["Team"]
    ball_state: Optional["BallState"]
    statistics: List[Statistic] = field(default_factory=list)

I'll probably be working on implementing this in the near future and adding a parser for StatsBomb.