Allow users to pass a function to add extra columns to `to_pandas` output

PySport / kloppy

kloppy: standardizing soccer tracking- and event data

https://kloppy.pysport.org

BSD 3-Clause "New" or "Revised" License

356 stars 56 forks source link

Allow users to pass a function to add extra columns to `to_pandas` output #29

Closed rjtavares closed 4 years ago

rjtavares commented 4 years ago

My use case: using Statsbomb data, I wanted to know which player was pressured in each pressure events.

To solve that I used a custom _record_converter when I called to_pandas, similar to _event_to_pandas_row_converter but with an extra line in the row dict:

first_related_event = event.raw_event['related_events'][0] if 'related_events' in event.raw_event else None,

It worked perfectly (I then joined by the event_id to get the player id from the related event).

The ideal solution would be to allow custom functions to build extra columns to the DataFrame. Something like this:

get_related_event = lambda x: x.raw_event['related_events'][0] if 'related_events' in x.raw_event else None
to_pandas(dataset, custom_fields={'first_related_event': get_related_event})

Would this make sense?

koenvo commented 4 years ago

Yes that makes sense! I was thinking the record to pandas function could be exposes so it’s easier to do composition, but that makes it harder to use.

Better to do that “magic” hidden for the end user.

Could you create a pull request for this?

rjtavares commented 4 years ago

Yes, I'll try.

What you want the argument to be called (I don't love custom_fields)?

koenvo commented 4 years ago

I would probably go for “additional_columns”, or “custom_columns”. And it might be useful to let it also accept scalar values in case someone would like to add static data - like match id.

rjtavares commented 4 years ago

Alright, so I think I got it working, but I can't run the tests successfully (although I can't run the tests on the master either, I don't think I broke anything).

Are there any problems with the tests?

You can check my changes here

This runs as expected:

dataset = datasets.load("statsbomb")
get_first_related_event = lambda x: x.raw_event['related_events'][0] if 'related_events' in x.raw_event else None
to_pandas(dataset, additional_columns={'related_event': get_first_related_event, 'match': 'test1'})

dataset = datasets.load("metrica_tracking", options={"sample_rate": 1.0 / 12, "limit": 10})
to_pandas(dataset, additional_columns={'attacking_direction': lambda x: dataset.frames[0].period.attacking_direction.value, 'match': 'game1'})

koenvo commented 4 years ago

Thanks for the Pull request!

Do you get any error message when you try to run the tests? When you run "pytest" in the root it should be able to run all tests.

Some feedback on the code: [edit]i did a double check[/edit] What do you think of something like? This keeps 1) the interface of both _frame_to_pandas_row_converter and _event_to_pandas_row_converter untouched 2) makes it possible to pass _record_converter and additional_columns.


    def generic_record_converter(record: Union[Frame, Event]):
        row = _record_converter(record)
        if additional_columns:
            for k, v in additional_columns.items():
            if callable(v):
                value = v(record)
            else:
                value = v
            row.update({k: value})  

        return row

    return pd.DataFrame.from_records(map(generic_record_converter, dataset.records))

rjtavares commented 4 years ago

The error I get when running the tests:

====================================================== FAILURES =======================================================
________________________________________ TestOpta.test_correct_deserialization ________________________________________

self = <kloppy.tests.test_opta.TestOpta object at 0x000001B6FFA6DF48>

    def test_correct_deserialization(self):
        base_dir = os.path.dirname(__file__)

        serializer = OptaSerializer()

        with open(f"{base_dir}/files/opta_f24.xml", "rb") as f24_data, open(
            f"{base_dir}/files/opta_f7.xml", "rb"
        ) as f7_data:

            dataset = serializer.deserialize(
                inputs={"f24_data": f24_data, "f7_data": f7_data}
            )

        assert len(dataset.events) == 17
        assert len(dataset.periods) == 2
        assert dataset.orientation == Orientation.ACTION_EXECUTING_TEAM
>       assert dataset.periods[0] == Period(
            id=1,
            start_timestamp=1537707733.608,
            end_timestamp=1537710501.222,
            attacking_direction=AttackingDirection.NOT_SET,
        )
E       AssertionError: assert Period(id=1, ...T: 'not-set'>) == Period(id=1, ...T: 'not-set'>)
E         Omitting 2 identical items, use -vv to show
E         Differing attributes:
E         start_timestamp: 1537711333.608 != 1537707733.608
E         end_timestamp: 1537714101.222 != 1537710501.222

kloppy\tests\test_opta.py:24: AssertionError
============================================ 1 failed, 12 passed in 3.62s =============================================

koenvo commented 4 years ago

Right. That looks like an issue with timezones. You ignore it for now and I’ll fix it.

rjtavares commented 4 years ago

OK. I'll address your suggestions and make a pull request.

koenvo commented 4 years ago

Great! If you need any help please let me know

rjtavares commented 4 years ago

Submitted: https://github.com/PySport/kloppy/pull/31