bachmannpatrick / CLVTools

R-Package for estimating CLV

54 stars 14 forks source link

Hessian could not be derived. Setting all entries to NA. #132

Closed cvermehren closed 4 years ago

cvermehren commented 4 years ago

Thank you for this wonderful new package, which includes the much needed possibility of accounting for co-variates.

I have tried the function pnbd(), but it results in NA predictions. The warning message is:

Starting estimation...
Estimation finished!
Warning: Estimation failed with NA coefs. The returened object contains results but further usage is restricted.
Warning: Hessian could not be derived. Setting all entries to NA.

I have used a real transaction dataset. From this, I filtered the data to include only a cohort defined as all customers starting during the first two month of a year.

I have compared with BTYDPlus using the same dataset and experienced no problems.

Code:

cht.clv <- cht[, list(
  Id = cust,
  Date = date,
  Price = sales
  )]

cht.clv <- clvdata(
  cht.clv, 
  date.format = "ymd",
  time.unit = "w",
  estimation.split = 145

  )

summary(cht.clv)
 CLV Transaction Data 

 Time unit         Weeks         
 Estimation length 145.0000 Weeks
 Holdout length    145.5714 Weeks

 Transaction Data Summary 
 Estimation       Holdout          Total       
 Number of customers                -                -                3427        
 First Transaction in period        2015-01-01       2017-10-13       2015-01-01  
 Last Transaction in period         2017-10-11       2020-07-28       2020-07-28  
 Total -- Transactions               7157             2546             9703        
 Mean -- Transactions per cust       2.088            3.017            2.831       
 (SD)                               2.269            3.440            3.795       
 Mean Spending per Transaction      1010.948         1273.318         1079.792    
 (SD)                               747.756          954.229          815.222     
 Total Spending                     7235357.279      3241867.913      10477225.191
 Total -- zero repeaters             2048             2583             1789        
 Percentage -- zero repeaters        0.598            0.754            0.522       
 Mean Interpurchase time            35.256           28.087           60.268      
 (SD) 

est <- pnbd(cht.clv)
predict(est)

predict(params)
 Predicting from 2017-10-13 until (incl.) 2020-07-28 (145.71 Weeks).
    Id      period.first period.last period.length actual.x actual.spending PAlive CET DERT predicted.Spending predicted.CLV
 1: ***     2017-10-13  2020-07-28      145.7143        1           359.2     NA  NA   NA           594.9644            NA
 2: ***     2017-10-13  2020-07-28      145.7143        0             0.0    NaN  NA  NaN          1022.0748           NaN
 3: ***     2017-10-13  2020-07-28      145.7143        0             0.0     NA  NA   NA          1468.8352            NA
 4: ***     2017-10-13  2020-07-28      145.7143        1          1829.6    NaN  NA  NaN          1022.0748           NaN
 5: ***     2017-10-13  2020-07-28      145.7143        0             0.0     NA  NA   NA          1070.2460            NA
-- ---                                                                                                                           
 3423: ***  2017-10-13  2020-07-28      145.7143        0             0.0     NA  NA   NA          1022.0748            NA
 3424: ***  2017-10-13  2020-07-28      145.7143        0             0.0    NaN  NA  NaN          1022.0748           NaN
 3425: ***  2017-10-13  2020-07-28      145.7143        0             0.0     NA  NA   NA          1022.0748            NA
 3426: ***  2017-10-13  2020-07-28      145.7143        0             0.0    NaN  NA  NaN          1022.0748           NaN
 3427: ***  2017-10-13  2020-07-28      145.7143        2          2238.4     NA  NA   NA           897.1967            NA

pschil commented 4 years ago

thanks for your feedback.

This issue is likely related to the optimization method that is used to maximize the likelihood function. By default CLVTools currently uses method L-BFGS-B which will break, immediately stop, and return NA coefficients if non-finite values are returned from the likelihood function. Unfortunately, the log-likelihood function for pnbd requires the notorious hypergeometric 2F1 function which depending on the data and parameters will yield NaNs or Infs.

Hence, for a more stable optimization you likely want to use method Nelder-Mead instead which will not terminate if non-finite values are returned:

est <- pnbd(clv.data = cht.clv, optimx.args = list(method="Nelder-Mead"))

# or for more insights about the optimization
est <- pnbd(clv.data = cht.clv, optimx.args = list(method="Nelder-Mead", control=list(trace=6)))

See ?optimx for what other options are available and see also other examples in ?pnbd.

See also this previous issue about the coefficients being NA where some more details are provided: #112.

The error therefore is not really about the hessian that could not be derived but in general about the estimation failing. In the next release, it will no longer be possible to predict() and plot() if there are NA in the coefficients because it it is obviously rather confusing. Feel free to already work with the latest version currently on the development branch as it also contains other (exciting!) new features such as fitting the Gamma-Gamma (gg()) spending model separately:

devtools::install_github(repo = "bachmannpatrick/CLVTools", ref = "development")

Note that it requires the GSL library in order to compile successfully from source.

Further, I would in general recommend to also investigate the model fit with summary(est) and with plot(est) before proceeding with predicting.

Another minor thing I just saw in your code:

cht.clv <- cht[, list(Id = cust, Date = date, Price = sales)]
cht.clv <- clvdata(cht.clv, date.format = "ymd", time.unit = "w", estimation.split = 145)

can be done with

cht.clv <- clvdata(cht.clv, date.format = "ymd", time.unit = "w", estimation.split = 145, name.id = "cust", name.date="date", name.price = "sales")

Patrik

cvermehren commented 4 years ago

Hi Patrik,

Thanks for the detailed and clear explanation – it makes a lot of sense!

I will do as you recommend, and I really look forward to following your work on this exciting package!

All the best Christian

thanks for your feedback.

This issue is likely related to the optimization method that is used to maximize the likelihood function. By default CLVTools currently uses method L-BFGS-B which will break, immediately stop, and return NA coefficients if non-finite values are returned from the likelihood function. Unfortunately, the log-likelihood function for pnbd requires the notorious hypergeometric 2F1 function which depending on the data and parameters will yield NaNs or Infs.

Hence, for a more stable optimization you likely want to use method Nelder-Mead instead which will not terminate if non-finite values are returned:

est <- pnbd(clv.data = cht.clv, optimx.args = list(method="Nelder-Mead"))

or for more insights about the optimization

est <- pnbd(clv.data = cht.clv, optimx.args = list(method="Nelder-Mead", trace = 6))

See ?optimx for what other options are available and see also other examples in ?pnbd.

See also this previous issue about the coefficients being NA where some more details are provided: #112https://github.com/bachmannpatrick/CLVTools/issues/112.

The error therefore is not really about the hessian that could not be derived but in general about the estimation failing. In the next release, it will no longer be possible to predict() and plot() if there are NA in the coefficients because it it is obviously rather confusing. Feel free to already work with the latest version currently on the development branch as it also contains other (exciting!) new features such as fitting the Gamma-Gamma (gg()) spending model separately:

devtools::install_github(repo = "bachmannpatrick/CLVTools", ref = "development")

Note that it requires the GSL library in order to compile successfully from source.

Further, I would in general recommend to also investigate the model fit with summary(est) and with plot(est) before proceeding with predicting.

Another minor thing I just saw in your code:

cht.clv <- cht[, list(Id = cust, Date = date, Price = sales)]

cht.clv <- clvdata(cht.clv, date.format = "ymd", time.unit = "w", estimation.split = 145)

can be done with

cht.clv <- clvdata(cht.clv, date.format = "ymd", time.unit = "w", estimation.split = 145, name.id = "cust", name.date="date", name.price = "sales")

Patrik

cvermehren commented 4 years ago

Hi Patrik,

Here is some more feedback. I first ran the following code:

est <- pnbd(clv.data = cht.clv, optimx.args = list(method="Nelder-Mead"))
summary(est)

Result:

Pareto NBD Standard  Model 

Call:
pnbd(clv.data = cht.clv, optimx.args = list(method = "Nelder-Mead"))

Fitting period:                                
Estimation start  2015-01-01    
Estimation end    2017-10-12    
Estimation length 145.0000 Weeks

Coefficients:
       Estimate Std. Error z-val Pr(>|z|)
r        0.5329         NA    NA       NA
alpha   38.0598         NA    NA       NA
s       64.2792         NA    NA       NA
beta  7721.5264         NA    NA       NA

Optimization info:                  
LL     -16176.2972
AIC    32360.5945 
BIC    32384.1955 
KKT 1  NA         
KKT 2  NA         
fevals 597.0000   
Method Nelder-Mead

Used Options:                 
Correlation FALSE
Warning message:
For some parameters the standard error could not be calculated.

Then this code: predict(est)

Result:

  Id period.first period.last period.length actual.x actual.spending    PAlive        CET DERT predicted.Spending
   1: ***   2017-10-13  2020-07-28      145.7143        1         359.200 0.2918841 0.34676760  Inf           591.6977
   2: ***   2017-10-13  2020-07-28      145.7143        0           0.000 0.2241285 0.05602038  Inf          1002.0443
   3: ***   2017-10-13  2020-07-28      145.7143        0           0.000 0.1439510 0.17330498  Inf          1459.1719
   4: ***   2017-10-13  2020-07-28      145.7143        0           0.000 0.5490419 1.98142399  Inf          1067.7495
   5: ***   2017-10-13  2020-07-28      145.7143        4        3103.074 0.8837950 1.86889894  Inf           725.0541

2694: ***   2017-10-13  2020-07-28      145.7143        0           0.000 0.2418898 0.06295234  Inf          1002.0443
2695: ***   2017-10-13  2020-07-28      145.7143        0           0.000 0.1181469 0.08844852  Inf           897.3543
2696: ***   2017-10-13  2020-07-28      145.7143        0           0.000 0.1980979 0.14830231  Inf           880.8387
2697: ***   2017-10-13  2020-07-28      145.7143        1         554.280 0.3549185 0.61237296  Inf           673.1570
2698: ***   2017-10-13  2020-07-28      145.7143        0           0.000 0.2418898 0.06295234  Inf          1002.0443
      predicted.CLV
   1:           Inf
   2:           Inf
   3:           Inf
   4:           Inf
   5:           Inf

2694:           Inf
2695:           Inf
2696:           Inf
2697:           Inf
2698:           Inf

What is puzzling is that it was possible to estimate p alive and both the average spending and CET, but not CLV which should be simply a multiplication of spending and CET, right?

Thanks Christian

cvermehren commented 4 years ago

For you information, here is a plot of est - plot(est):

mmeierer commented 4 years ago

Thanks for sharing. The fit is not too bad for the standard Pareto/NBD model.

Looks like you could really benefit from controlling for seasonality ;-)

Be aware that the estimation with time-varying covariates will take considerable more time than the estimation without covariates. Executing the code in the walkthrough (https://www.clvtools.com/articles/CLVTools.html) gives you a good overview on this.

Are you able to share about which industry we are talking here?

cvermehren commented 4 years ago

Hi Patrik,

If you send me an email to cv@cantab.net, then I will share some info about the dataset.

Christian

pschil commented 4 years ago

but not CLV which should be simply a multiplication of spending and CET, right?

Not quite. CLV is the multiplication of DERT (Discounted Expected Residual Transactions) mean spending. Or more formally, you assume spending to be constant and can pull it out the integral which then is the DERT (=transactions survival discount rate, from T to Infinity): <a href="https://www.codecogs.com/eqnedit.php?latex=E(RLV)&space;=&space;E(Spending)&space;&space;\int_T^{\infty}{E[t(t)]S(t|t>T)d(t-T)dt}" target="_blank"><img src="https://latex.codecogs.com/png.latex?E(RLV)&space;=&space;E(Spending)&space;&space;\int_T^{\infty}{E[t(t)]S(t|t>T)d(t-T)dt}" title="E(RLV) = E(Spending) \int_T^{\infty}{E[t(t)]*S(t|t>T)d(t-T)dt}" /> See also ?predict.clv.fitted for more information about the prediction output (will be accessible through ?predict in the next version).

From the summary() output the beta coefficient seems very high with 7721 which leads to DERT = Inf, likely due to the hypergeometric 1F1 function involved in the pnbd's DERT expression.

In case your customers are currently from different industries or business segments you might want to try to break them up and fit the model separately in order to obtain more helpful parameter estimates. Also, removing customers that only bought once during the calibration period ("zero-repeaters") may yield better estimates.

cvermehren commented 4 years ago

I also suspect that there are in fact different customer segments involved, but it isn't clear from the data.

Thanks for the general tips!