GoldenCheetah / OpenData

A project to collect, collate and share an open data set with contributions from users of the GoldenCheetah application
38 stars 6 forks source link

Python library for working with OpenData #2

Open AartGoossens opened 6 years ago

AartGoossens commented 6 years ago

Continuing this discussion here.

I am working on some Python code to make working with OpenData easier. It's far from finished (it only sort of works for my use-case now) but I would like to share it and putting in this repository makes sense. Before I spend more time on polishing it I'd like some input on what the library should look like.

Features I would like to have in the library:

  1. View metadata of all athletes: currently the metadata lives in the blob for each athlete so you need to download all the data to view it. I propose to create a metadata file in the root of this repo that is updated every once in a while to reflect new/changed files in the OSF directory.
  2. Tool to selectively download data: Only download a specific athlete, or only athletes with specific data types, date ranges, amounts of data, etc. based on the metadata.
  3. Should return the activities in a general purpose data format. I propose to use a pandas.DataFrame for this.
  4. Tool to make running computations on large amounts of activities easier: Not sure how to do this yet but with the amount of data that's already in OpenData it's impossible to have it all in memory so some clever batch-processing is needed there and I think some tooling might help there and has it's place in this library.

Any input is welcome!

liversedge commented 6 years ago

I think we should separate augmenting the raw data files with an index since we will likely publish the index rather than ask users to generate it.

Then we have separate tooling for retrieving and formatting data that uses the index to support filtering and so on.

As soon as the tooling starts to post process data then I think it belongs in scikit-sports?

AartGoossens commented 6 years ago

I'm not sure if I understand what you mean but assuming you're talking about my point (1): I'm not proposing to change the data in OSF but want to create a file in this repository with a 'summary' of the data that is available. This library should not do anything with the data but should make accessing the data as easy as possible while returning the data as raw as possible. Everything else should indeed live in scikit-sports.

glemaitre commented 6 years ago

Actually it was the IO lib that I wanted aside of scikit-sports. Regarding the issue with the amount of data to be loaded, I see 2 solutions: memmap and dask dataframe and array. In the second case, dask ml will be a good AI solution as well.

glemaitre commented 6 years ago

For the IO I really think that the design of imageIO could help. Basically a single wrapper function which should follow those requirements for data type. Then each plugin should implement those parts.

AartGoossens commented 6 years ago

I don't agree with you there @glemaitre . In my opinion this library should be a light wrapper around osfclient which makes it possible for anyone with a little coding experience to start loading some activities from OpenData. From this point of view, solutions like Dask and numpy.memmap are overly complicated and overkill for this purpose.

This indeed means that there will probably be a separate more imageIO-like library that is targeted at ML/AI and usage from within scikit-sports. That library could use the code in this repo, but not necessarily. Another argument for this is that much of the lots-of-activities-in-memory challenges will be more generic (also applicable to loading e.g. FIT files) and therefore also should live outside this library.

glemaitre commented 6 years ago

I might have get confused actually.

A wrapper around osfclient will be a sort of dataset fetcher, isn't it. In this case, I agree that having a wrapper which allows to get specific data (user, sensors, ...) will be super useful.

Where I am getting confused is on reading of those data. It is where I would expect to use an IO which can return a specific format. Basically, once the data downloaded, I would expect to use the IO biking library.

Regarding memmap or dask dataframe, it will be transparent to the user. A numpy array read in memmap mode does look exactly as a numpy array. A dask.dataframe or dask.array will follow the same API than numpy and dask (apart of the constructor where you give the number of chunks). However, I agree that this is a bit stupid to use those when it fits in memory. So it might be an option to give when reading the data by allowing to return those type on demand.

mpuchowicz commented 6 years ago

I don't really have the background to know what features would be best for the library. Instead I will share a couple of projects that I would like to attempt with this data and what I would need to know about the data set in order to include or exclude it.

Potential projects:

  1. Effect of MMP time range (min, max, mean, median) used for model fitting on the CP model and 3-parameter models parameter estimates.

For this project, I would want to be able to pull data sets by a season or year. The data set requirement would be at least 50 (or some other arbitrarily high number) of power files that are at least 1 hour long. Demographic information such as age, sex, height, weight, competitive category etc would also be helpful but not a requirement.

  1. Effect of data inclusion window (30 days, 60 days, 90 days, 120 days, etc) on MMP and model parameter estimates.

Again, it would be able to pull data sets by a season or year. The data set requirement would be 180 days of power files with a rolling 14 day average of at least 3 power files (ie a week off wouldn't be an exclusion but several weeks off would).

  1. Effect of prior work (stress score, heart rate, w'bal etc) on MMP and and model parameter estimates.

Here the data set would need to be pulled by 60 day blocks. The data set requirement would be power files with a 7 day rolling average of at least 3 power files (ie a could days off wouldn't be an exclusion but a week off would). Obviously to do the heart rate then each power file would have to have a matching heart rate file.

So in general, what would be helpful would be some way to filter based on time blocks or seasons, and by the length, consistency, and density of the power files over the block or season.

Thanks in advance for all the work that is going into the open data project, it is very appreciated. I anticipate that it will be a great resource.

mp twitter(@dpveloclinic)

AartGoossens commented 6 years ago

@glemaitre Ah now I get your point. I think the discussion then is whether this library is specifically meant to be used from/in combination with scikit-sports or if it's use case is more generic. I was thinking of making it more generic.

AartGoossens commented 6 years ago

@mpuchowicz Thanks a lot for your input. This helps a lot in thinking about how the interface should be and which features are needed.

AartGoossens commented 6 years ago

I created a WIP PR here. It's far from polished but it shows the direction I'm heading.

Some of the features:

Any feedback would be appreciated.

To figure out:

liversedge commented 6 years ago

Just push it and we can play and update ?

I have some views on what should be put into the one-big-metadatafile:

I think I need to look at this stuff now, as we now have over 250k workouts and nearly 400 athletes data !

AartGoossens commented 6 years ago

I'm fine with merging my PR now but I suspect some rewriting will happen so do not rely on the stability of the interface for now...

I think in the end there will be 2 metadata files: one with general data about athletes and a more extensive one with summary statistics for all activities. The metadata csv in the PR is of the second kind. This file contains all metadata from all activities but for 3 athletes this file is already 1.6MB, so to limit the file size we probably need to prune most of the columns (which is fine I think). For local usage the generate_metadata() method might already be useful and sufficiently good as is.

liversedge commented 6 years ago

That's a big metadata file :)

I'm cool with things changing rapidly, anything is better than nothing !

glemaitre commented 6 years ago

Where should the data be stored? Cwd? Home directory? Ask the user to specify the location?

You could make something similar to this. In this way, the user can set it and you have a default location. I think that our default is fine but I will probably not hide it (i.e. .open....).

glemaitre commented 6 years ago

Store the data as original csv or as e.g. parquet file (smaller size and faster loading)?

Parquet is nice. I would go for it if we are not going to do anything with the metadata (IO and visualization with Excel).

liversedge commented 6 years ago

I also vote for something like parquet -- the data is likely to grow to millions of workout files over the next 2-3 years.

AartGoossens commented 6 years ago

You could make something similar to this. In this way, the user can set it and you have a default location. I think that our default is fine but I will probably not hide it (i.e. .open....).

Good idea, I like that approach. I'm also fine with not hiding the directory. I'll tackle this in another PR.

The change to parquet is quite easy and can be done in a later PR.

AartGoossens commented 5 years ago

Since I did a complete rewrite of the Python library I am tempted to close this issue even though some of the discussion here (e.g. about Parquet files) have not been resolved (although I do not think they are completely relevant anymore).

@liversedge @glemaitre are you ok with closing?