better / convoys

Implementation of statistical models to analyze time lagged conversions
https://better.engineering/convoys/
MIT License
254 stars 42 forks source link

Decoupling visualization from models #124

Closed davidgasquez closed 4 years ago

davidgasquez commented 4 years ago

Hey there! We, the Buffer data team, recently discovered this awesome package, and we're starting to use it in different analysis.

We're used to doing most of the plotting with R. I've started to work on getting the data back from the Matplotlib figure but seems like a hack and was wondering if you've thought about decoupling the plotting from the modelling.

Prophet, from Facebook, does a great job at that and it'll return a DataFrame with the required data to plot. The same prophet library will also have a default .plot function that uses Matplotlib. That helps users use other plotting frameworks.

I'm happy to help with the coding if I can figure out how to better do the decoupling. Let me know if you have any questions too. :smile:

Thanks for open sourcing such a helpful library!

PS: We've also found that using a large group size will result in a confusing legend in the final plot. This one can be probably fixed using the proper Matplotlib arguments though. This example shows weeks in one of our plots:

2020-03-24_16:28:35_183x348

davidgasquez commented 4 years ago

Heya @erikbern! Curious if you have any feedback on this idea. Would love to hear what do you think and what could be the best approach in this case.

erikbern commented 4 years ago

Hi @davidgasquez! Thanks for the comment.

If you look at the code for plotting things, it actually doesn't do much that you can't do without plotting:

  1. It fits a model https://github.com/better/convoys/blob/master/convoys/plotting.py#L62
  2. It gets predictions https://github.com/better/convoys/blob/master/convoys/plotting.py#L84

You should be able to do those things quite easily without dealing with the plotting code, by just grabbing the models and fitting them yourself.

The interface in convoys follows scikit-learn a bit more, so the models don't operate on dataframes directly, which is maybe what you want? If you want to build some kind of dataframe-to-dataframe thing that takes a dataframe of censored data and returns the "survival" curves, then I think you could definitely build that, and it's possible the plotting could be rewritten to be more dataframe-native. Right now the dataframe-specific code is limited to the code here https://github.com/better/convoys/blob/master/convoys/utils.py#L54

davidgasquez commented 4 years ago

Hey there! Thanks for the feedback and guidelines. I played a bit with the raw models and got this small function that returns a Pandas DataFrame. Sharing it in case anyone finds it helpful in the future!

def get_dataframe(G, B, T, model='kaplan-meier', ci=None, groups=None):

  t_max = max(1, max(T))

  m = _models[model](ci=bool(ci))
  m.fit(G, B, T)

  t = numpy.linspace(0, t_max, 1000)

  df = pd.DataFrame(index=t)

  for i, group in enumerate(groups):
    j = groups.index(group)
    p_y = m.cdf(j, t).T
    df[str(group)] = p_y

  return df