GoldenCheetah / sweatpy

Endurance sports analysis library for Python
http://sweatpy.gssns.io
MIT License
76 stars 22 forks source link

Add from_strava(activity_id) method to WDF #8

Open sladkovm opened 6 years ago

sladkovm commented 6 years ago

It would be nice to be able to populate WDF from multiple sources. Strava is one very obvious pick.

To make this work efficiently few things must be agreed upon:

  1. Access Strava API directly using requests and json (currently implemented in strava.py and very easy to reason about) or rely on stravalib with all benefits of errors handling, but with an overhead of response-object mapping.

  2. Decide on what to do with the @requires decorator. Strava calls power - watts for example. One option is to get rid of the decorator and allow methods to specify the column name as an input argument (what seaborn or pandas.plot do for example). Another option is to rename the columns on the fly to canonical names so @requires doesn't get confused. I do prefer the first option.

AartGoossens commented 6 years ago

A. Good idea to make working with strava data as easy as possible. I'm not sure if I would want to add this to the wdf class because I like that class to be as data source agnostic as possible. I think it could work to have a method sweat.io.strava.get_activity that returns a wdf.

B. To be honest I do not really have a preference for stravalib or our own code. Maybe 1 argument for stravalib is not having to fix things ourselves when strava deprecates/updates their api.

C. I want to get rid of the requires decorator. It doesn't offer much benefit and is counter intuitive. But I do want to enforce storing power always in the same column ('power') in the wdf and this should be handled in io.{strava/goldencheetah/etc}. Example: If strava returns power as "watts", the logic in ios.strava should handle storing it in the power column while creating the wdf. I think this ensures that, no matter the source of the workout, the wdf always works the same way and end users don't have to worry about the underlying data structure that Strava returns. Does this make sense to you?

sladkovm commented 6 years ago

A. I'm might be biased by knowing that WDF subclassing DF, but IO methods of DF is one of the cornerstones of this library. The legend says that it all started with making read_csv for humans. I naively assume that subclassing would favor enriching the native IO methods with sweat domain specific ones: Strava, Golden Cheetah, FIT etc.

B. I've checked it yesterday - still don't like it. Their to_dict() methods does not guarantee the same data structure that was ingested from Strava API. It is a no go for me. If I ever decide to pull strava api using JS I want to get the same data structure back as I'm getting it in my python code.

C. I probably still don't understand the actual purpose of WDF than. If it is enriched DF, then I would expect all new methods to follow the same design philosophy as native ones. In this particular case, it would mean being able to call any method on any column irrespectively of its name.

Said that, if I will abstract away from the fact that it has DF under the bonnet and will simply see it as a Class designed for its own purpose, then few factory methods indeed can take care of massaging various data sources into canonical form. But at the moment I personally don't envision how such Class would enter the data flows I'm working with.

AartGoossens commented 6 years ago

A. I don't think I agree with your point. An end user that wants to import data from strava would just use sweat.io.strava and get a wdf in return, he would never have to deal with import from sweat.models.dataframes.

B. stravalib is a no-go then. :)

C. I think I get your point: since I moved all business logic to pure functions (the algorithms) I am not using pandas.Dataframe builtin methods for that either. But I still see the benefit of being able to do wdf.compute_mean_max_power() (without specifying where the power is stored) or wdf.power.mean(). If I have time next week I'll create a POC for retrieving data from GoldenCheetah so I can show the workflow I have in mind.

sladkovm commented 6 years ago

A.

What I wanted to say is if we keep the DF nature intact, then we should not shy away from using IO methods (https://pandas.pydata.org/pandas-docs/stable/io.html). If we treat WDF as a completely separate entity with it's own design decision, then data source specific factory method is OK for me.

AartGoossens commented 6 years ago

With your analogy my approach is identical to pandas: IO methods are not on the dataframe class but are separate methods in the pandas/sweat library: pandas.read_csv vs sweat.io.strava.get_activity. As far as I know there are no IO methods on the pandas.Dataframe class.

Two extra things I want to mention:

  1. Would it make a difference for you if the wdf class lives in io.models? That would make it clear that the wdf is a class that is typically used by the IO functionality.
  2. Another argument for the wdf to be data source agnostic is that adding a new source, for example runkeeper, will be as straightforward as adding a sweat.io.runkeeper module that contains all the necessary logic without changing (or "polluting" if there will be many sources) any of the other modules.
sladkovm commented 6 years ago

You are right, it is indeed a factory method... If that is how pandas does it, then I absolutely don't want to invent anything new.

  1. I don't have a strong opinion on it. io as a go-to place for all data ingestion methods sounds reasonable. I'm not very excited about models as a name. As a new user, I wouldn't feel like to browse in such place for data ingestion methods. What do you think about calling it what it is - athletic_pandas or workout_pandas or sweat_pandas? Such module would contain WDF class as well as various factory methods to make WDF objects from different sources. The actual API calls to the sources are handled in e.g. io.strava so users can also make use of these methods to populate traditional DFs if it fit's their flow better.

  2. I'm ok with WDF having backed in nomenclature. As in the previous bullet, I propose:

    • keep sweat.io.resouce modules for low-level handling of APIs with returning a dict that can be injected directly into traditional DF
    • factory methods to generate WDF from different sources would reside in the, for example, sweat.io.sweat_pandas module and will take care of data normalization.
AartGoossens commented 6 years ago
  1. I think end users shouldn't have to deal with importing from sweat.io.models, for example sweat.io.goldencheetah should handle that. The end users just imports sweat.io.goldencheetah.get_latest_activity(), calls it and gets a wdf in return.

  2. Ah, this is interesting: Do you think there is a use-case where it's a drawback if getting data from Strava always returns a wdf (instead of a dict)? If that's the case then I understand your hesitance with the wdf. I will create a POC of how I envision working with data sources. I think this should make more clear what our differences are and we can continue discussing from there. Deal? :)

sladkovm commented 6 years ago
  1. Ok, because my alternative proposal would be to allow importing WDF and WDF returning read_strava method directly from the sweat.io namespace (basically what pandas does with their IO methods)

  2. It is a complex question:

Following your proposal, the structure of io could be:

sweat.io.name_of_the_resource - returns WDF and is easily discoverable and used by new to Python users sweat.io.name_of_the_resource.api - contains low-level API calls and returns serializable outputs (typically a dict or a list) and is used by seasoned developers, who want to incorporate sweat into their DF based workflow without locking into WDF design choices.

AartGoossens commented 6 years ago

Good idea, in your proposal both work flows are supported without adding much nesting to the modules. So then the structure for sweat.io would be?:

+-- io
|     +-- strava
|           +-- __init__.py (contains methods that return wdf; uses code from api to fill wdf)
|           +-- api.py (contains low-lever api calls)
|     +-- goldencheetah
|           +-- __init__.py
|           +-- api.py
|     +-- ...etc

Or if the api is more complex even:

+-- io
|     +-- strava
|           +-- __init__.py (contains methods that return wdf)
|           +-- api (contains low-lever api calls)
|                 +-- __init__.py
|                 +-- complex_code.py
|     +-- ...etc

If you think the low-level api is not too much nested down (sweat.io.strava.api) then I'm in.

sladkovm commented 6 years ago

I think it is ok, the sweat.io.strava.api is meant for brave developers and they should be able to discover it.

btw, is it a normal practice to define functions in the __init__.py modules? We can also define everything in api.py and make WDF constructors be available in the top namespace through import in __init__.py

AartGoossens commented 6 years ago

Good point about code in __init__.py: It's good practice not to have code there. Importing from init() could work or if we want to be really strict about we could go for something like this:

+-- io
|     +-- strava.py
|     +-- goldencheetah.py
|     +-- api
|           +-- strava.py
|           +-- goldencheetah.py