Closed koenvo closed 3 years ago
@koenvo what is the best option to adopt as the default one in your opinion?
I like the TRACAB one because units are in meters, but on the other hand it's difficult to know for example when a player is in the last third. On the other hand a system like Metrica's one, or Statsbomb, have a fix size for which knowing positions in the field is easy but if you want to compute speed and distances you have to look into the specific dimensions of the field.
I think I favor the Metrica (what a surprise!) / Stats approach. But curious what your take is on this.
Two approaches:
Option 1 is easier to calculate real distances and angles Option 2 is easier to calculate / compare distance to goal / etc. You can use `if 40 < point.y < 60:'. Easier to run queries over multiple datasets because you don't need metadata (of where box is)
Option 1: when you want distance to goal / "is in box" you need coordinates of box (in meters) Option 2: when you precise notion of of ball is in box you need coordinates of box (in 100x100 coordinates)
So.
Option 1: Pro: more precise Cons: always need metadata to calculate if in box / distance to goal
Option 2: Pro: easier to work with Cons: less precise
I would prefer option 2: we can always attach metadata to go from 100x100 dimensions to real ones to calculate precise data. But this allows people to start going without hassle. Most people don't care about more precision and if they want they can.
So:
Prefer ease over precision
@koenvo great, as discussed off line, we will go with [0, 1] coordinate system. Favoring ease over precision, and with [0, 1] rather than [0, 100] so that it can't lead users to believe they are in meters.
About the specific implementation There are two ways (that I can see) in which we can do it.
transform_frame
that takes in the Metadata
and a Frame
and returns normalized Frame
depending on the option given to the serializer. transform_dataset
after the data was loaded. If we want to make it default it would have to be done inside the serializer as well, but it can be done once the data was loaded. The second option is more clear IMO, and also allows for the same method to be used after the data was loaded if you didn't do it by default. However it involves looping over all records at the end of the serialization process.
The first option is more eficiente, buy may be a bit less transparent with what is going on.
I'm more inclined to do the first one with also a possibility to it to the whole dataset after it's loaded.
What do you think? Also, is there a better way to do it other than this two?
[draft version]
I think it’s related to https://github.com/PySport/kloppy/issues/10
Like you want to define a pipeline (only mapping and no filtering in case of this issue) that can be applied during loading and afterwards. Afterwards would require a ‘transform_dataset’ that runs all record through the pipeline. Especially for tracking data it saves quite some time to do it during loading (have to profile this: copying objects makes it slow, not the looping over all frames?)
Another relevant question I received on Twitter: “ I have noticed that all the dimensions dont scale identically For example, statsbomb has 12080 and Opta has 100100 for the overall field dimensions So I can use simple scalings to change one to another However, on doing so, some of the other lengths dont match For example, applying the above scaling, the penalty areas become of different sizes and the penalty spot becomes different”
We can also implement a function that scales to [0, 1] but with a fixed penalty box size, like statsbomb. But that can be an additional later on.
To make it easier for people to start going and don’t have to think about pitch dimensions we should add a global option to convert all coordinates to the same dimensions (meters for example).
Related twitter thread
https://mobile.twitter.com/lemonwatcher/status/1285854097233633281