PySport / kloppy

kloppy: standardizing soccer tracking- and event data
https://kloppy.pysport.org
BSD 3-Clause "New" or "Revised" License
326 stars 55 forks source link

[IO] Improved IO with support for reading data from compressed files #308

Closed probberechts closed 2 weeks ago

probberechts commented 2 months ago

It is a common practice to store data as compressed files to reduce storage requirements. With this PR it is no longer needed to decompress the file before loading the data with kloppy.

from kloppy import statsperform

dataset = statsperform.load(
    raw_data="ma25_tracking.txt.gz",
    meta_data="ma1_metadata.xml.gz",
)

Whether a file is compressed is derived from the file's extension. Currently supports ".gz", ".xz" and ".bz2".

koenvo commented 2 months ago

This should also work non-local files, right? Like https://some-url.com/file.xml.gz

koenvo commented 1 month ago

Can you merge master in please to make sure tests run again

probberechts commented 1 month ago

I couldn't get boto (to mock an S3 bucket) to work on GitHub Actions. In the most recent version, there is this bug and for older versions I can't figure out a set of version constraints between s3fs and boto that works on each Python version. Hence, I propose to disable these tests until the bug is fixed.

I recently also found the xopen library for opening compressed files. We could use it as a more efficient and robust replacement of the _open method that I implemented. Do you think it is worth adding another dependency? It could also be an optional one.

koenvo commented 2 weeks ago

Thanks Pieter, great work!