better / convoys

Implementation of statistical models to analyze time lagged conversions
https://better.engineering/convoys/
MIT License
260 stars 42 forks source link

Use for Real-Time Scoring #128

Closed sugarcrm-aorso closed 4 years ago

sugarcrm-aorso commented 4 years ago

I'm trying to do some modeling where I have a large time lag for conversion, and I am interested in getting updated single observation likelihood of conversion predictions over the lifetime of an observation (at no specified interval, just when someone is interested and wants to look). Intuitively I'd expect the likelihood of conversion to be the highest for the first couple of days/weeks and past a certain point it essentially isn't going to convert, it's just too old.

I was looking at Cox Proportional Hazards models when I came across Convoys and it seemed to address my problem more directly, though many of the examples involve groups and aggregate conversion rates. I know there are regression classes and I was playing with those:

from convoys import regression, utils

unit, groups, (G, B, T) = utils.get_arrays(
    survival_df, 
    created='date_input', 
    converted='conversion_date', 
    unit='days', 
    features=[i for i in features if i not in ['date_input', 'conversion_date']]
)
gamma_model = regression.GeneralizedGamma(flavor='linear', ci=True)
gamma_model.fit(G,B,T)
gamma_model.predict([1 2 3], 30, ci=True)

but I was curious if I'm thinking about the interpretation of the output for real-time scoring correctly (i.e., an observation is to be scored at time t and the result is the likelihood of conversion at that point assuming the observation has not converted at this point). Similarly, if my features are time-dependent (e.g., may be null at creation, but I learn more about them over time), can that be factored in (after more thorough reading of the docs, I've seen this in future directions using RNN, do you have any papers you can point me at)?

Thank you in advance.

erikbern commented 4 years ago

Sorry for the slow answer.

I think essentially what you want to do is to compute the probability of conversion _conditional on conversioning not happening at any t<t0

This should be doable by just computing something like

large_t = 1000  # some large enough number that it represents the "final" conversion rate
p = model.predict(x, t=t_0)
q = model.predict(x, t=LARGE_T)

return 1 - (1 - q) / (1 - p)

Note that p < q and that p converges from 0 towards q as t_0 gets larger so the quantity 1 - (1 - q) / (1 - p) will start at q and drop towards 0.

I have a visual proof in my head for why this works but it's a bit hard to share on Github. I think there's some elementary proof using probability theory, but I always struggle getting the notation right so I'll skip it at this point.

Hope this helps!

sugarcrm-aorso commented 4 years ago

Thanks for the response Erik. Was able to get the result you mentioned by applying Bayes' Theorem.