Open sladkovm opened 6 years ago
A. Good idea to make working with strava data as easy as possible. I'm not sure if I would want to add this to the wdf class because I like that class to be as data source agnostic as possible. I think it could work to have a method sweat.io.strava.get_activity that returns a wdf.
B. To be honest I do not really have a preference for stravalib or our own code. Maybe 1 argument for stravalib is not having to fix things ourselves when strava deprecates/updates their api.
C. I want to get rid of the requires decorator. It doesn't offer much benefit and is counter intuitive. But I do want to enforce storing power always in the same column ('power') in the wdf and this should be handled in io.{strava/goldencheetah/etc}. Example: If strava returns power as "watts", the logic in ios.strava should handle storing it in the power column while creating the wdf. I think this ensures that, no matter the source of the workout, the wdf always works the same way and end users don't have to worry about the underlying data structure that Strava returns. Does this make sense to you?
A.
I'm might be biased by knowing that WDF subclassing DF, but IO methods of DF is one of the cornerstones of this library. The legend says that it all started with making read_csv for humans. I naively assume that subclassing would favor enriching the native IO methods with sweat
domain specific ones: Strava, Golden Cheetah, FIT etc.
B. I've checked it yesterday - still don't like it. Their to_dict() methods does not guarantee the same data structure that was ingested from Strava API. It is a no go for me. If I ever decide to pull strava api using JS I want to get the same data structure back as I'm getting it in my python code.
C. I probably still don't understand the actual purpose of WDF than. If it is enriched DF, then I would expect all new methods to follow the same design philosophy as native ones. In this particular case, it would mean being able to call any method on any column irrespectively of its name.
Said that, if I will abstract away from the fact that it has DF under the bonnet and will simply see it as a Class designed for its own purpose, then few factory methods indeed can take care of massaging various data sources into canonical form. But at the moment I personally don't envision how such Class would enter the data flows I'm working with.
A. I don't think I agree with your point. An end user that wants to import data from strava would just use sweat.io.strava and get a wdf in return, he would never have to deal with import from sweat.models.dataframes.
B. stravalib is a no-go then. :)
C. I think I get your point: since I moved all business logic to pure functions (the algorithms) I am not using pandas.Dataframe builtin methods for that either. But I still see the benefit of being able to do wdf.compute_mean_max_power() (without specifying where the power is stored) or wdf.power.mean(). If I have time next week I'll create a POC for retrieving data from GoldenCheetah so I can show the workflow I have in mind.
A.
What I wanted to say is if we keep the DF nature intact, then we should not shy away from using IO methods (https://pandas.pydata.org/pandas-docs/stable/io.html). If we treat WDF as a completely separate entity with it's own design decision, then data source specific factory method is OK for me.
With your analogy my approach is identical to pandas: IO methods are not on the dataframe class but are separate methods in the pandas/sweat library: pandas.read_csv
vs sweat.io.strava.get_activity
. As far as I know there are no IO methods on the pandas.Dataframe class.
Two extra things I want to mention:
io.models
? That would make it clear that the wdf is a class that is typically used by the IO functionality.sweat.io.runkeeper
module that contains all the necessary logic without changing (or "polluting" if there will be many sources) any of the other modules.You are right, it is indeed a factory method... If that is how pandas
does it, then I absolutely don't want to invent anything new.
I don't have a strong opinion on it. io
as a go-to place for all data ingestion methods sounds reasonable. I'm not very excited about models
as a name. As a new user, I wouldn't feel like to browse in such place for data ingestion methods. What do you think about calling it what it is - athletic_pandas
or workout_pandas
or sweat_pandas
? Such module would contain WDF class as well as various factory methods to make WDF objects from different sources. The actual API calls to the sources are handled in e.g. io.strava
so users can also make use of these methods to populate traditional DFs if it fit's their flow better.
I'm ok with WDF having backed in nomenclature. As in the previous bullet, I propose:
sweat.io.resouce
modules for low-level handling of APIs with returning a dict
that can be
injected directly into traditional DFsweat.io.sweat_pandas
module and will take care of data normalization.I think end users shouldn't have to deal with importing from sweat.io.models
, for example sweat.io.goldencheetah
should handle that. The end users just imports sweat.io.goldencheetah.get_latest_activity()
, calls it and gets a wdf in return.
Ah, this is interesting: Do you think there is a use-case where it's a drawback if getting data from Strava always returns a wdf (instead of a dict)? If that's the case then I understand your hesitance with the wdf. I will create a POC of how I envision working with data sources. I think this should make more clear what our differences are and we can continue discussing from there. Deal? :)
Ok, because my alternative proposal would be to allow importing WDF and WDF returning read_strava
method directly from the sweat.io
namespace (basically what pandas
does with their IO methods)
It is a complex question:
I do not see velometria.com, for example, relying on WDF. It uses native Strava nomenclature throughout the app. The flow is to pull Strava API, load result into DF, apply function from sweat.algorithms
on the DF.
Should sweat
contain functions to call different API? That is the choice we have to make. I think it is a good idea because it will help with onboarding of new users, especially in combination with WDF, but not limited to it.
Following your proposal, the structure of io
could be:
sweat.io.name_of_the_resource
- returns WDF and is easily discoverable and used by new to Python users
sweat.io.name_of_the_resource.api
- contains low-level API calls and returns serializable outputs (typically a dict
or a list
) and is used by seasoned developers, who want to incorporate sweat
into their DF based workflow without locking into WDF design choices.
Good idea, in your proposal both work flows are supported without adding much nesting to the modules.
So then the structure for sweat.io
would be?:
+-- io
| +-- strava
| +-- __init__.py (contains methods that return wdf; uses code from api to fill wdf)
| +-- api.py (contains low-lever api calls)
| +-- goldencheetah
| +-- __init__.py
| +-- api.py
| +-- ...etc
Or if the api is more complex even:
+-- io
| +-- strava
| +-- __init__.py (contains methods that return wdf)
| +-- api (contains low-lever api calls)
| +-- __init__.py
| +-- complex_code.py
| +-- ...etc
If you think the low-level api is not too much nested down (sweat.io.strava.api
) then I'm in.
I think it is ok, the sweat.io.strava.api
is meant for brave developers and they should be able to discover it.
btw, is it a normal practice to define functions in the __init__.py
modules? We can also define everything in api.py
and make WDF constructors be available in the top namespace through import in __init__.py
Good point about code in __init__.py
: It's good practice not to have code there. Importing from init() could work or if we want to be really strict about we could go for something like this:
+-- io
| +-- strava.py
| +-- goldencheetah.py
| +-- api
| +-- strava.py
| +-- goldencheetah.py
It would be nice to be able to populate WDF from multiple sources. Strava is one very obvious pick.
To make this work efficiently few things must be agreed upon:
Access Strava API directly using
requests
andjson
(currently implemented instrava.py
and very easy to reason about) or rely on stravalib with all benefits of errors handling, but with an overhead of response-object mapping.Decide on what to do with the @requires decorator. Strava calls power - watts for example. One option is to get rid of the decorator and allow methods to specify the column name as an input argument (what
seaborn
orpandas.plot
do for example). Another option is to rename the columns on the fly to canonical names so @requires doesn't get confused. I do prefer the first option.