PySport / kloppy

kloppy: standardizing soccer tracking- and event data
https://kloppy.pysport.org
BSD 3-Clause "New" or "Revised" License
371 stars 61 forks source link

[Wyscout V3] Adding substitution events #367

Open DriesDeprest opened 4 days ago

DriesDeprest commented 4 days ago

In Wyscout V3 data substitutions are not listed in the event stream, but defined separately in the raw event file. What is the best way to handle them in the deserializer?

I can create them after that we iterated through all the raw events, but how do I should I best go about adding them into the records? Do we already have a method that allows us to order a list of events based on period & timestamp? Or should I create this?

probberechts commented 3 days ago

Inserting substitutions into the event stream is more complex than simple sorting because Wyscout only provides the minute in which a substitution occurred, not the precise timestamp.

You would need to:

  1. Identify game interruptions within the substitution minute's window.
  2. Use the interruption duration as a tiebreaker if multiple interruptions occur in that window.
  3. Insert the substitution event just before the corresponding game restart.
DriesDeprest commented 3 days ago

Thanks for the input, Pieter!

What would happen if we mark that the substitutions happen exactly at the minute provider by Wyscout. I know we can then be off for X, no more than 60, seconds. Would this result into issues other than that the substitution time information is just slightly off?

The reason I'm asking, is because I don't really care about seconds as a level of detail for substitutions and would suggest that I first create a PR which introduces substitutions using the naive approach. And if someone later needs seconds as a level of detail, he could enhance the logic with identifying game interruptions to improve the level of detail of setting substitutions.

I just don't want the desire for a perfect solution to stand in the way of already implementing the main goal, getting the substitution information (albeit with a lower level of detail) in the dataset.

What do you think?

probberechts commented 3 days ago

You will get substitution events when the ball is in play and logical event sequences will get interrupted (e.g., you could get a substitution between a player's carry and pass). It just doesn't make sense at all.

Also, it will break code that derives things from subsequent events. For example, I have some logic that determines whether the ball is in play. This will break.

It's really not that hard to implement it correctly.

DriesDeprest commented 3 days ago

Okay, if it results into interrupted event sequences and can break downstream logic, it should indeed be done correctly directly. I'll share a PR soon. Thanks for thinking this through together!

DriesDeprest commented 6 minutes ago

@probberechts when reviewing the substitutions in the events v3 file of Wyscout, it looks like it is not expressed in minute granularity but rather the exact seconds.

I assume we thus do not need to identify game interruptions and just insert this into the records between the events happening before and after the substitution?

"substitutions": {
        "3159": {
            "2H": {
                "1278": {
                    "in": [
                        {
                            "playerId": 20395
                        }
                    ],
                    "out": [
                        {
                            "playerId": 489124
                        }
                    ]
                },
                "1951": {
                    "in": [
                        {
                            "playerId": 361807
                        }
                    ],
                    "out": [
                        {
                            "playerId": 20751
                        }
                    ]
                },
                "2192": {
                    "in": [
                        {
                            "playerId": 105334
                        },
                        {
                            "playerId": 345695
                        }
                    ],
                    "out": [
                        {
                            "playerId": 472363
                        },
                        {
                            "playerId": 20461
                        }
                    ]
                }
            }
        },
        "3164": {
            "2H": {
                "4": {
                    "in": [
                        {
                            "playerId": 703
                        },
                        {
                            "playerId": 20479
                        },
                        {
                            "playerId": 20689
                        }
                    ],
                    "out": [
                        {
                            "playerId": 415809
                        },
                        {
                            "playerId": 449978
                        },
                        {
                            "playerId": 21006
                        }
                    ]
                },
                "1461": {
                    "in": [
                        {
                            "playerId": 449472
                        },
                        {
                            "playerId": 20446
                        }
                    ],
                    "out": [
                        {
                            "playerId": 239298
                        },
                        {
                            "playerId": 237057
                        }
                    ]
                }
            }
        }
    }