Open AartGoossens opened 6 years ago
I think we should separate augmenting the raw data files with an index since we will likely publish the index rather than ask users to generate it.
Then we have separate tooling for retrieving and formatting data that uses the index to support filtering and so on.
As soon as the tooling starts to post process data then I think it belongs in scikit-sports?
I'm not sure if I understand what you mean but assuming you're talking about my point (1): I'm not proposing to change the data in OSF but want to create a file in this repository with a 'summary' of the data that is available. This library should not do anything with the data but should make accessing the data as easy as possible while returning the data as raw as possible. Everything else should indeed live in scikit-sports.
Actually it was the IO lib that I wanted aside of scikit-sports. Regarding the issue with the amount of data to be loaded, I see 2 solutions: memmap and dask dataframe and array. In the second case, dask ml will be a good AI solution as well.
For the IO I really think that the design of imageIO could help. Basically a single wrapper function which should follow those requirements for data type. Then each plugin should implement those parts.
I don't agree with you there @glemaitre . In my opinion this library should be a light wrapper around osfclient which makes it possible for anyone with a little coding experience to start loading some activities from OpenData. From this point of view, solutions like Dask and numpy.memmap are overly complicated and overkill for this purpose.
This indeed means that there will probably be a separate more imageIO-like library that is targeted at ML/AI and usage from within scikit-sports. That library could use the code in this repo, but not necessarily. Another argument for this is that much of the lots-of-activities-in-memory challenges will be more generic (also applicable to loading e.g. FIT files) and therefore also should live outside this library.
I might have get confused actually.
A wrapper around osfclient will be a sort of dataset fetcher, isn't it. In this case, I agree that having a wrapper which allows to get specific data (user, sensors, ...) will be super useful.
Where I am getting confused is on reading of those data. It is where I would expect to use an IO which can return a specific format. Basically, once the data downloaded, I would expect to use the IO biking library.
Regarding memmap
or dask dataframe
, it will be transparent to the user. A numpy array read in memmap
mode does look exactly as a numpy array. A dask.dataframe
or dask.array
will follow the same API than numpy
and dask
(apart of the constructor where you give the number of chunks). However, I agree that this is a bit stupid to use those when it fits in memory. So it might be an option to give when reading the data by allowing to return those type on demand.
I don't really have the background to know what features would be best for the library. Instead I will share a couple of projects that I would like to attempt with this data and what I would need to know about the data set in order to include or exclude it.
Potential projects:
For this project, I would want to be able to pull data sets by a season or year. The data set requirement would be at least 50 (or some other arbitrarily high number) of power files that are at least 1 hour long. Demographic information such as age, sex, height, weight, competitive category etc would also be helpful but not a requirement.
Again, it would be able to pull data sets by a season or year. The data set requirement would be 180 days of power files with a rolling 14 day average of at least 3 power files (ie a week off wouldn't be an exclusion but several weeks off would).
Here the data set would need to be pulled by 60 day blocks. The data set requirement would be power files with a 7 day rolling average of at least 3 power files (ie a could days off wouldn't be an exclusion but a week off would). Obviously to do the heart rate then each power file would have to have a matching heart rate file.
So in general, what would be helpful would be some way to filter based on time blocks or seasons, and by the length, consistency, and density of the power files over the block or season.
Thanks in advance for all the work that is going into the open data project, it is very appreciated. I anticipate that it will be a great resource.
mp twitter(@dpveloclinic)
@glemaitre Ah now I get your point. I think the discussion then is whether this library is specifically meant to be used from/in combination with scikit-sports or if it's use case is more generic. I was thinking of making it more generic.
@mpuchowicz Thanks a lot for your input. This helps a lot in thinking about how the interface should be and which features are needed.
I created a WIP PR here. It's far from polished but it shows the direction I'm heading.
Some of the features:
.opendata
directory.Any feedback would be appreciated.
To figure out:
Just push it and we can play and update ?
I have some views on what should be put into the one-big-metadatafile:
I think I need to look at this stuff now, as we now have over 250k workouts and nearly 400 athletes data !
I'm fine with merging my PR now but I suspect some rewriting will happen so do not rely on the stability of the interface for now...
I think in the end there will be 2 metadata files: one with general data about athletes and a more extensive one with summary statistics for all activities.
The metadata csv in the PR is of the second kind. This file contains all metadata from all activities but for 3 athletes this file is already 1.6MB, so to limit the file size we probably need to prune most of the columns (which is fine I think). For local usage the generate_metadata()
method might already be useful and sufficiently good as is.
That's a big metadata file :)
I'm cool with things changing rapidly, anything is better than nothing !
Where should the data be stored? Cwd? Home directory? Ask the user to specify the location?
You could make something similar to this. In this way, the user can set it and you have a default location. I think that our default is fine but I will probably not hide it (i.e. .open....
).
Store the data as original csv or as e.g. parquet file (smaller size and faster loading)?
Parquet is nice. I would go for it if we are not going to do anything with the metadata (IO and visualization with Excel).
I also vote for something like parquet -- the data is likely to grow to millions of workout files over the next 2-3 years.
You could make something similar to this. In this way, the user can set it and you have a default location. I think that our default is fine but I will probably not hide it (i.e. .open....).
Good idea, I like that approach. I'm also fine with not hiding the directory. I'll tackle this in another PR.
The change to parquet is quite easy and can be done in a later PR.
Since I did a complete rewrite of the Python library I am tempted to close this issue even though some of the discussion here (e.g. about Parquet files) have not been resolved (although I do not think they are completely relevant anymore).
@liversedge @glemaitre are you ok with closing?
Continuing this discussion here.
I am working on some Python code to make working with OpenData easier. It's far from finished (it only sort of works for my use-case now) but I would like to share it and putting in this repository makes sense. Before I spend more time on polishing it I'd like some input on what the library should look like.
Features I would like to have in the library:
pandas.DataFrame
for this.Any input is welcome!