PySport / kloppy

kloppy: standardizing soccer tracking- and event data
https://kloppy.pysport.org
BSD 3-Clause "New" or "Revised" License
328 stars 55 forks source link

[Opta] Ordering of events #267

Open probberechts opened 6 months ago

probberechts commented 6 months ago

I noticed that Opta events can sometimes be slightly out of order. The F24 docs specify that the following attributes (in the given order) should be used to order each team's match events chronologically:

image

Only sorting by timestamp does not always give the same result. For example:

<Event id="1889768843" event_id="358" type_id="1" period_id="1" min="32" sec="3" player_id="59062" team_id="174" outcome="0" x="21.6" y="39.2" timestamp="2018-08-20T21:32:27.98" last_modified="2018-08-20T21:32:28" version="1534797148460"></Event>         
<Event id="1592827425" event_id="228" type_id="1" period_id="1" min="32" sec="4" player_id="80908" team_id="957" outcome="0" x="60.4" y="52.0" timestamp="2018-08-20T21:32:27.635" last_modified="2018-08-21T16:43:18" version="1534866198424"></Event>

Since the Opta deserializer currently only parses the "timestamp" field, it does not seem possible to order events chronologically.

koenvo commented 6 months ago

Are there any details on how to properly sort on correctly and maintain millisecond precision?

A solution could be to extract timestamp from “min” and “sec” attributes but than we lose the precision.

probberechts commented 6 months ago

My documentation doesn't mention the precision of the "timestamp" field. However, my version of the documentation is extremely outdated. Maybe @JanVanHaaren has something more up-to-date.

I find it strange that the "timestamp" field does not align with the "min" and "sec" fields. If the precision of the "timestamp" field would be inferior to the "min" and "sec" fields, I don't see why we would infer an (incorrect) millisecond precision from it.

probberechts commented 6 months ago

Looking at a few more timestamps, I now realize that Opta does not add leading zeros to the milliseconds. So, "2018-08-20T21:32:27.98" is actually "2018-08-20T21:32:27.098000".

Python's %f pads zeros to the right, while we should pad zeros to the left to parse the Opta timestamp. We should simply adapt the timestamp parser and then it should work.

%f is an extension to the set of format characters in the C standard (but implemented separately in datetime objects, and therefore always available). When used with the strptime() method, the %f directive accepts from one to six digits and zero pads on the right.

JanVanHaaren commented 6 months ago

The min and sec fields on one hand and the timestamp field on the other hand provide different pieces of information about an event. The min and sec fields provide the game time in minutes and seconds when the event occurred, whereas the timestamp field provides the date and time when the event was logged in UK time. Hence, the timestamp field can be used as a tie-breaker to order events but not to derive the time when the event occurred in the match.

Documentation Opta F24

Documentation Stats Perform MA3

probberechts commented 6 months ago

So, to conclude, would it be okay to fill the "timestamp" field in Kloppy with min + sec and order events based on min + sec + timestamp?

JanVanHaaren commented 6 months ago

That suggestion sounds good to me. The Wyscout V3 deserializer fills the timestamp field based on the minute and second fields too although it would probably be better to use the provided matchTimestamp field. The StatsBomb deserializer uses the provided timestamp.

Should we explicitly store a sequence number for each event as well? StatsBomb and Wyscout explicitly provide a sequence number in the index and eventIndex fields, respectively.

probberechts commented 6 months ago

Should we explicitly store a sequence number for each event as well? StatsBomb and Wyscout explicitly provide a sequence number in the index and eventIndex fields, respectively.

I would just make sure that the records in a dataset are chronologically ordered. Storing a sequence number then does not provide any added value since you would be able to infer it from the position in the list of records.

koenvo commented 6 months ago

Small question about the timestamp vs min/sec: when the record is not altered afterwards, does the timestamp match the min/sec? so only when the record is altered the timestamp loses value, correct?

JanVanHaaren commented 6 months ago

Small question about the timestamp vs min/sec: when the record is not altered afterwards, does the timestamp match the min/sec? so only when the record is altered the timestamp loses value, correct?

My understanding is that the timestamp field is never updated. The timestamp field reflects the time when the event was initially entered in the database and the last_modified field reflects the time when the event was last updated in the database.

I suspect that the timestamp field is reasonably accurate for events that are recorded live. However, not all event data is recorded live and events can occasionally be inserted at a later time during the match or even after the match.

probberechts commented 6 months ago

Although, according to my old documentation the timestamp field reflects the time that the event occured within the match. 😕

image

JanVanHaaren commented 6 months ago

I will contact the Stats Perform support desk. The official documentation is confusing.

Documentation website

JanVanHaaren commented 6 months ago

I haven't heard back yet from Stats Perform, but I think I finally understand how the timestamps work. I suspect the meaning of the timestamp field depends on the coverage level. The event timestamps are detailed to the millisecond for some but not all coverage levels.

For example, the event data for this friendly match between Salzburg and Ried has coverage level 14. The game took place on 12 October 2023, but the timestamp for the kick-off event is 2023-10-15T08:49:39.373Z.

{
    "id": "9130ocq9mdrosrd4mv7a666tw",
    "coverageLevel": "14",
    "date": "2023-10-12Z",
    "time": "12:00:00Z",
    "localDate": "2023-10-12",
    "localTime": "14:00:00",
    "numberOfPeriods": 2,
    "periodLength": 45,
    "overtimeLength": 15,
    "lastUpdated": "2023-11-25T12:46:38Z",
    "description": "Salzburg vs Ried",
    ...
},
{
    "id": 2604454267,
    "eventId": 3,
    "typeId": 1,
    "periodId": 1,
    "timeMin": 0,
    "timeSec": 0,
    "contestantId": "do3l4dhs0ooog6se728jxc06z",
    "playerId": "3rmiekqhf431q783nhdc2m12h",
    "playerName": "W. Eza",
    "outcome": 1,
    "x": 49.8,
    "y": 50.0,
    "timeStamp": "2023-10-15T08:49:39.373Z",
    "lastModified": "2023-10-16T00:39:15Z",
    "qualifier": [
        ...
    ]
},
probberechts commented 6 months ago

The question is rather whether they can be used as a reliable way to measure the relative time that has passed since the "period start" event.

JanVanHaaren commented 6 months ago

I don't know yet, but my feeling is that it should be possible for the highest coverage levels. I'll investigate a few more matches. Unfortunately, I don't have access to much event data that was collected at lower coverage levels.