Closed rjtavares closed 4 years ago
Yes that makes sense! I was thinking the record to pandas function could be exposes so it’s easier to do composition, but that makes it harder to use.
Better to do that “magic” hidden for the end user.
Could you create a pull request for this?
Yes, I'll try.
What you want the argument to be called (I don't love custom_fields)?
I would probably go for “additional_columns”, or “custom_columns”. And it might be useful to let it also accept scalar values in case someone would like to add static data - like match id.
Alright, so I think I got it working, but I can't run the tests successfully (although I can't run the tests on the master either, I don't think I broke anything).
Are there any problems with the tests?
You can check my changes here
This runs as expected:
dataset = datasets.load("statsbomb")
get_first_related_event = lambda x: x.raw_event['related_events'][0] if 'related_events' in x.raw_event else None
to_pandas(dataset, additional_columns={'related_event': get_first_related_event, 'match': 'test1'})
dataset = datasets.load("metrica_tracking", options={"sample_rate": 1.0 / 12, "limit": 10})
to_pandas(dataset, additional_columns={'attacking_direction': lambda x: dataset.frames[0].period.attacking_direction.value, 'match': 'game1'})
Thanks for the Pull request!
Do you get any error message when you try to run the tests? When you run "pytest" in the root it should be able to run all tests.
Some feedback on the code:
[edit]i did a double check[/edit]
What do you think of something like? This keeps 1) the interface of both _frame_to_pandas_row_converter
and _event_to_pandas_row_converter
untouched 2) makes it possible to pass _record_converter
and additional_columns
.
def generic_record_converter(record: Union[Frame, Event]):
row = _record_converter(record)
if additional_columns:
for k, v in additional_columns.items():
if callable(v):
value = v(record)
else:
value = v
row.update({k: value})
return row
return pd.DataFrame.from_records(map(generic_record_converter, dataset.records))
The error I get when running the tests:
====================================================== FAILURES =======================================================
________________________________________ TestOpta.test_correct_deserialization ________________________________________
self = <kloppy.tests.test_opta.TestOpta object at 0x000001B6FFA6DF48>
def test_correct_deserialization(self):
base_dir = os.path.dirname(__file__)
serializer = OptaSerializer()
with open(f"{base_dir}/files/opta_f24.xml", "rb") as f24_data, open(
f"{base_dir}/files/opta_f7.xml", "rb"
) as f7_data:
dataset = serializer.deserialize(
inputs={"f24_data": f24_data, "f7_data": f7_data}
)
assert len(dataset.events) == 17
assert len(dataset.periods) == 2
assert dataset.orientation == Orientation.ACTION_EXECUTING_TEAM
> assert dataset.periods[0] == Period(
id=1,
start_timestamp=1537707733.608,
end_timestamp=1537710501.222,
attacking_direction=AttackingDirection.NOT_SET,
)
E AssertionError: assert Period(id=1, ...T: 'not-set'>) == Period(id=1, ...T: 'not-set'>)
E Omitting 2 identical items, use -vv to show
E Differing attributes:
E start_timestamp: 1537711333.608 != 1537707733.608
E end_timestamp: 1537714101.222 != 1537710501.222
kloppy\tests\test_opta.py:24: AssertionError
============================================ 1 failed, 12 passed in 3.62s =============================================
Right. That looks like an issue with timezones. You ignore it for now and I’ll fix it.
OK. I'll address your suggestions and make a pull request.
Great! If you need any help please let me know
Submitted: https://github.com/PySport/kloppy/pull/31
My use case: using Statsbomb data, I wanted to know which player was pressured in each pressure events.
To solve that I used a custom
_record_converter
when I calledto_pandas
, similar to_event_to_pandas_row_converter
but with an extra line in the row dict:It worked perfectly (I then joined by the event_id to get the player id from the related event).
The ideal solution would be to allow custom functions to build extra columns to the DataFrame. Something like this:
Would this make sense?