Closed cvermehren closed 4 years ago
Hi
thanks for your feedback.
This issue is likely related to the optimization method that is used to maximize the likelihood function.
By default CLVTools currently uses method L-BFGS-B
which will break, immediately stop, and return NA coefficients if non-finite values are returned from the likelihood function. Unfortunately, the log-likelihood function for pnbd requires the notorious hypergeometric 2F1 function which depending on the data and parameters will yield NaNs or Infs.
Hence, for a more stable optimization you likely want to use method Nelder-Mead
instead which will not terminate if non-finite values are returned:
est <- pnbd(clv.data = cht.clv, optimx.args = list(method="Nelder-Mead"))
# or for more insights about the optimization
est <- pnbd(clv.data = cht.clv, optimx.args = list(method="Nelder-Mead", control=list(trace=6)))
See ?optimx
for what other options are available and see also other examples in ?pnbd
.
See also this previous issue about the coefficients being NA where some more details are provided: #112.
The error therefore is not really about the hessian that could not be derived but in general about the estimation failing. In the next release, it will no longer be possible to predict()
and plot()
if there are NA in the coefficients because it it is obviously rather confusing.
Feel free to already work with the latest version currently on the development
branch as it also contains other (exciting!) new features such as fitting the Gamma-Gamma (gg()
) spending model separately:
devtools::install_github(repo = "bachmannpatrick/CLVTools", ref = "development")
Note that it requires the GSL library in order to compile successfully from source.
Further, I would in general recommend to also investigate the model fit with summary(est)
and with plot(est)
before proceeding with predicting.
Another minor thing I just saw in your code:
cht.clv <- cht[, list(Id = cust, Date = date, Price = sales)]
cht.clv <- clvdata(cht.clv, date.format = "ymd", time.unit = "w", estimation.split = 145)
can be done with
cht.clv <- clvdata(cht.clv, date.format = "ymd", time.unit = "w", estimation.split = 145, name.id = "cust", name.date="date", name.price = "sales")
Patrik
Hi Patrik,
Thanks for the detailed and clear explanation – it makes a lot of sense!
I will do as you recommend, and I really look forward to following your work on this exciting package!
All the best Christian
Hi
thanks for your feedback.
This issue is likely related to the optimization method that is used to maximize the likelihood function. By default CLVTools currently uses method L-BFGS-B which will break, immediately stop, and return NA coefficients if non-finite values are returned from the likelihood function. Unfortunately, the log-likelihood function for pnbd requires the notorious hypergeometric 2F1 function which depending on the data and parameters will yield NaNs or Infs.
Hence, for a more stable optimization you likely want to use method Nelder-Mead instead which will not terminate if non-finite values are returned:
est <- pnbd(clv.data = cht.clv, optimx.args = list(method="Nelder-Mead"))
est <- pnbd(clv.data = cht.clv, optimx.args = list(method="Nelder-Mead", trace = 6))
See ?optimx for what other options are available and see also other examples in ?pnbd.
See also this previous issue about the coefficients being NA where some more details are provided: #112https://github.com/bachmannpatrick/CLVTools/issues/112.
The error therefore is not really about the hessian that could not be derived but in general about the estimation failing. In the next release, it will no longer be possible to predict() and plot() if there are NA in the coefficients because it it is obviously rather confusing. Feel free to already work with the latest version currently on the development branch as it also contains other (exciting!) new features such as fitting the Gamma-Gamma (gg()) spending model separately:
devtools::install_github(repo = "bachmannpatrick/CLVTools", ref = "development")
Note that it requires the GSL library in order to compile successfully from source.
Further, I would in general recommend to also investigate the model fit with summary(est) and with plot(est) before proceeding with predicting.
Another minor thing I just saw in your code:
cht.clv <- cht[, list(Id = cust, Date = date, Price = sales)]
cht.clv <- clvdata(cht.clv, date.format = "ymd", time.unit = "w", estimation.split = 145)
can be done with
cht.clv <- clvdata(cht.clv, date.format = "ymd", time.unit = "w", estimation.split = 145, name.id = "cust", name.date="date", name.price = "sales")
Patrik
Hi Patrik,
Here is some more feedback. I first ran the following code:
est <- pnbd(clv.data = cht.clv, optimx.args = list(method="Nelder-Mead"))
summary(est)
Result:
Pareto NBD Standard Model
Call:
pnbd(clv.data = cht.clv, optimx.args = list(method = "Nelder-Mead"))
Fitting period:
Estimation start 2015-01-01
Estimation end 2017-10-12
Estimation length 145.0000 Weeks
Coefficients:
Estimate Std. Error z-val Pr(>|z|)
r 0.5329 NA NA NA
alpha 38.0598 NA NA NA
s 64.2792 NA NA NA
beta 7721.5264 NA NA NA
Optimization info:
LL -16176.2972
AIC 32360.5945
BIC 32384.1955
KKT 1 NA
KKT 2 NA
fevals 597.0000
Method Nelder-Mead
Used Options:
Correlation FALSE
Warning message:
For some parameters the standard error could not be calculated.
Then this code:
predict(est)
Result:
Id period.first period.last period.length actual.x actual.spending PAlive CET DERT predicted.Spending
1: *** 2017-10-13 2020-07-28 145.7143 1 359.200 0.2918841 0.34676760 Inf 591.6977
2: *** 2017-10-13 2020-07-28 145.7143 0 0.000 0.2241285 0.05602038 Inf 1002.0443
3: *** 2017-10-13 2020-07-28 145.7143 0 0.000 0.1439510 0.17330498 Inf 1459.1719
4: *** 2017-10-13 2020-07-28 145.7143 0 0.000 0.5490419 1.98142399 Inf 1067.7495
5: *** 2017-10-13 2020-07-28 145.7143 4 3103.074 0.8837950 1.86889894 Inf 725.0541
2694: *** 2017-10-13 2020-07-28 145.7143 0 0.000 0.2418898 0.06295234 Inf 1002.0443
2695: *** 2017-10-13 2020-07-28 145.7143 0 0.000 0.1181469 0.08844852 Inf 897.3543
2696: *** 2017-10-13 2020-07-28 145.7143 0 0.000 0.1980979 0.14830231 Inf 880.8387
2697: *** 2017-10-13 2020-07-28 145.7143 1 554.280 0.3549185 0.61237296 Inf 673.1570
2698: *** 2017-10-13 2020-07-28 145.7143 0 0.000 0.2418898 0.06295234 Inf 1002.0443
predicted.CLV
1: Inf
2: Inf
3: Inf
4: Inf
5: Inf
2694: Inf
2695: Inf
2696: Inf
2697: Inf
2698: Inf
What is puzzling is that it was possible to estimate p alive and both the average spending and CET, but not CLV which should be simply a multiplication of spending and CET, right?
Thanks Christian
For you information, here is a plot of est - plot(est):
Thanks for sharing. The fit is not too bad for the standard Pareto/NBD model.
Looks like you could really benefit from controlling for seasonality ;-)
Be aware that the estimation with time-varying covariates will take considerable more time than the estimation without covariates. Executing the code in the walkthrough (https://www.clvtools.com/articles/CLVTools.html) gives you a good overview on this.
Are you able to share about which industry we are talking here?
Hi Patrik,
If you send me an email to cv@cantab.net, then I will share some info about the dataset.
Christian
but not CLV which should be simply a multiplication of spending and CET, right?
Not quite. CLV is the multiplication of DERT (Discounted Expected Residual Transactions) mean spending. Or more formally, you assume spending to be constant and can pull it out the integral which then is the DERT (=transactions survival discount rate, from T to Infinity):
<a href="https://www.codecogs.com/eqnedit.php?latex=E(RLV)&space;=&space;E(Spending)&space;&space;\int_T^{\infty}{E[t(t)]S(t|t>T)d(t-T)dt}" target="_blank"><img src="https://latex.codecogs.com/png.latex?E(RLV)&space;=&space;E(Spending)&space;&space;\int_T^{\infty}{E[t(t)]S(t|t>T)d(t-T)dt}" title="E(RLV) = E(Spending) \int_T^{\infty}{E[t(t)]*S(t|t>T)d(t-T)dt}" />
See also ?predict.clv.fitted
for more information about the prediction output (will be accessible through ?predict
in the next version).
From the summary() output the beta coefficient seems very high with 7721 which leads to DERT = Inf, likely due to the hypergeometric 1F1 function involved in the pnbd's DERT expression.
In case your customers are currently from different industries or business segments you might want to try to break them up and fit the model separately in order to obtain more helpful parameter estimates. Also, removing customers that only bought once during the calibration period ("zero-repeaters") may yield better estimates.
I also suspect that there are in fact different customer segments involved, but it isn't clear from the data.
Thanks for the general tips!
Thank you for this wonderful new package, which includes the much needed possibility of accounting for co-variates.
I have tried the function pnbd(), but it results in NA predictions. The warning message is:
I have used a real transaction dataset. From this, I filtered the data to include only a cohort defined as all customers starting during the first two month of a year.
I have compared with BTYDPlus using the same dataset and experienced no problems.
Code: