CamDavidsonPilon / lifetimes

Lifetime value in Python
MIT License
1.45k stars 374 forks source link

_customer_lifetime_value method broken #423

Closed sam-lupton closed 2 years ago

sam-lupton commented 2 years ago

The input type for the frequency for _customer_lifetime_value is given as an array_like. This isn't quite true, as frequency.index is used to instantiate a DataFrame in line one of the method here. This means only pandas Series/DFs can be used as the frequency.

Unfortunately, if you do use pandas Series as all of the arguments, you often get a pandas NotImplemented error due to a supposed mixing of DataFrame and Series types. I will link the traceback of one such example below, where a GammaGamma model was fitted and called using a fitted ParetoNBD model. Could this be fixed, or some workarounds provided?

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
/var/folders/z7/h1rmk7f11kg5yg15rzg3xk8h0000gn/T/ipykernel_35606/2994602167.py in <module>
     12         returning_customers_summary['monetary_value'])
     13 
---> 14 print(ggf.customer_lifetime_value(
     15     pnbd, #the model to use to predict the number of future transactions
     16     summary_with_money_value['frequency'],

~/opt/anaconda3/envs/ltv/lib/python3.9/site-packages/lifetimes/fitters/gamma_gamma_fitter.py in customer_lifetime_value(self, transaction_prediction_model, frequency, recency, T, monetary_value, time, discount_rate, freq)
    293         adjusted_monetary_value = self.conditional_expected_average_profit(frequency, monetary_value)
    294 
--> 295         return _customer_lifetime_value(
    296             transaction_prediction_model, frequency, recency, T, adjusted_monetary_value, time, discount_rate, freq=freq
    297         )

~/opt/anaconda3/envs/ltv/lib/python3.9/site-packages/lifetimes/utils.py in _customer_lifetime_value(transaction_prediction_model, frequency, recency, T, monetary_value, time, discount_rate, freq)
    495     for i in steps * factor:
    496         # since the prediction of number of transactions is cumulative, we have to subtract off the previous periods
--> 497         expected_number_of_transactions = transaction_prediction_model.predict(
    498             i, frequency, recency, T
    499         ) - transaction_prediction_model.predict(i - factor, frequency, recency, T)

~/opt/anaconda3/envs/ltv/lib/python3.9/site-packages/lifetimes/fitters/pareto_nbd_fitter.py in conditional_expected_number_of_purchases_up_to_time(self, t, frequency, recency, T)
    277         r, alpha, s, beta = params
    278 
--> 279         likelihood = self._conditional_log_likelihood(params, x, t_x, T)
    280         first_term = (
    281             gammaln(r + x) - gammaln(r) + r * log(alpha) + s * log(beta) - (r + x) * log(alpha + T) - s * log(beta + T)

~/opt/anaconda3/envs/ltv/lib/python3.9/site-packages/lifetimes/fitters/pareto_nbd_fitter.py in _conditional_log_likelihood(params, freq, rec, T)
    212 
    213         A_1 = gammaln(r + x) - gammaln(r) + r * log(alpha) + s * log(beta)
--> 214         log_A_0 = ParetoNBDFitter._log_A_0(params, x, rec, T)
    215 
    216         A_2 = logaddexp(-(r + x) * log(alpha + T) - s * log(beta + T), log(s) + log_A_0 - log(r_s_x))

~/opt/anaconda3/envs/ltv/lib/python3.9/site-packages/lifetimes/fitters/pareto_nbd_fitter.py in _log_A_0(params, freq, recency, age)
    179 
    180         rsf = r + s + freq
--> 181         p_1 = hyp2f1(rsf, t, rsf + 1.0, abs_alpha_beta / (max_of_alpha_beta + recency))
    182         q_1 = max_of_alpha_beta + recency
    183         p_2 = hyp2f1(rsf, t, rsf + 1.0, abs_alpha_beta / (max_of_alpha_beta + age))

~/opt/anaconda3/envs/ltv/lib/python3.9/site-packages/pandas/core/generic.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   2030         self, ufunc: np.ufunc, method: str, *inputs: Any, **kwargs: Any
   2031     ):
-> 2032         return arraylike.array_ufunc(self, ufunc, method, *inputs, **kwargs)
   2033 
   2034     # ideally we would define this to avoid the getattr checks, but

~/opt/anaconda3/envs/ltv/lib/python3.9/site-packages/pandas/core/arraylike.py in array_ufunc(self, ufunc, method, *inputs, **kwargs)
    290             # well. Previously this raised an internal ValueError. We might
    291             # support it someday, so raise a NotImplementedError.
--> 292             raise NotImplementedError(
    293                 "Cannot apply ufunc {} to mixed DataFrame and Series "
    294                 "inputs.".format(ufunc)

NotImplementedError: Cannot apply ufunc <ufunc 'hyp2f1'> to mixed DataFrame and Series inputs.
nicknelson1021 commented 2 years ago

I am wondering the same exact thing. @sam-lupton, did you ever find a solution? Thank you!

sam-lupton commented 2 years ago

@nicknelson1021 I went with a fairly hacky solution of replacing that .index attribute on line 489 of utils.py with something that doesn't depend on pandas objects (e.g. range(len(frequency)), and then calling all subsequent methods in my actual script with .values for the arguments (e.g. summary['recency'].values), as the methods work fine when you use ndarrays rather than pandas objects.

I did this all in the local package file mind you, because I wasn't sure it's a good enough fix to contribute. Ideally you'd make all the functions work with both pandas objects and ndarrays!

On reflection though, the current version just randomly takes the index of the frequency argument (rather than one of the other 3 array-likes passed in), giving no indication to the user that that is the case, so anything would be better than that...

sam-lupton commented 2 years ago

Issue solved with this PR