PySport / kloppy

kloppy: standardizing soccer tracking- and event data
https://kloppy.pysport.org
BSD 3-Clause "New" or "Revised" License
356 stars 56 forks source link

ADD option to convert all coordinates to same dimensions #42

Closed koenvo closed 3 years ago

koenvo commented 4 years ago

To make it easier for people to start going and don’t have to think about pitch dimensions we should add a global option to convert all coordinates to the same dimensions (meters for example).

Related twitter thread

https://mobile.twitter.com/lemonwatcher/status/1285854097233633281

bdagnino commented 3 years ago

@koenvo what is the best option to adopt as the default one in your opinion?

I like the TRACAB one because units are in meters, but on the other hand it's difficult to know for example when a player is in the last third. On the other hand a system like Metrica's one, or Statsbomb, have a fix size for which knowing positions in the field is easy but if you want to compute speed and distances you have to look into the specific dimensions of the field.

I think I favor the Metrica (what a surprise!) / Stats approach. But curious what your take is on this.

image

koenvo commented 3 years ago

Two approaches:

  1. Use meters
  2. Use 100x100

Option 1 is easier to calculate real distances and angles Option 2 is easier to calculate / compare distance to goal / etc. You can use `if 40 < point.y < 60:'. Easier to run queries over multiple datasets because you don't need metadata (of where box is)

Option 1: when you want distance to goal / "is in box" you need coordinates of box (in meters) Option 2: when you precise notion of of ball is in box you need coordinates of box (in 100x100 coordinates)

So.

Option 1: Pro: more precise Cons: always need metadata to calculate if in box / distance to goal

Option 2: Pro: easier to work with Cons: less precise

I would prefer option 2: we can always attach metadata to go from 100x100 dimensions to real ones to calculate precise data. But this allows people to start going without hassle. Most people don't care about more precision and if they want they can.

So:

Prefer ease over precision

bdagnino commented 3 years ago

@koenvo great, as discussed off line, we will go with [0, 1] coordinate system. Favoring ease over precision, and with [0, 1] rather than [0, 100] so that it can't lead users to believe they are in meters.

About the specific implementation There are two ways (that I can see) in which we can do it.

  1. One way would be to do it on each frame at the moment the data is loaded (inside each serializer). We could use a version of transform_frame that takes in the Metadata and a Frame and returns normalized Frame depending on the option given to the serializer.
  2. The other option could be to do it via a transform_dataset after the data was loaded. If we want to make it default it would have to be done inside the serializer as well, but it can be done once the data was loaded.

The second option is more clear IMO, and also allows for the same method to be used after the data was loaded if you didn't do it by default. However it involves looping over all records at the end of the serialization process.

The first option is more eficiente, buy may be a bit less transparent with what is going on.

I'm more inclined to do the first one with also a possibility to it to the whole dataset after it's loaded.

What do you think? Also, is there a better way to do it other than this two?

koenvo commented 3 years ago

[draft version]

I think it’s related to https://github.com/PySport/kloppy/issues/10

Like you want to define a pipeline (only mapping and no filtering in case of this issue) that can be applied during loading and afterwards. Afterwards would require a ‘transform_dataset’ that runs all record through the pipeline. Especially for tracking data it saves quite some time to do it during loading (have to profile this: copying objects makes it slow, not the looping over all frames?)

Another relevant question I received on Twitter: “ I have noticed that all the dimensions dont scale identically For example, statsbomb has 12080 and Opta has 100100 for the overall field dimensions So I can use simple scalings to change one to another However, on doing so, some of the other lengths dont match For example, applying the above scaling, the penalty areas become of different sizes and the penalty spot becomes different”

We can also implement a function that scales to [0, 1] but with a fixed penalty box size, like statsbomb. But that can be an additional later on.