bachmannpatrick / CLVTools

R-Package for estimating CLV
54 stars 14 forks source link

Estimation failed with NA coefs error #112

Closed hermandr closed 4 years ago

hermandr commented 4 years ago

Hi,

Thank you writing this package. I tried with my data and it returns "Estimation failed with NA coefs":

summary(clv_tbs) :

CLV Transaction Data 

Time unit         Days          
Estimation length 1095.0000 Days
Holdout length    168.0000 Days 

Transaction Data Summary 
                                   Estimation      Holdout         Total     
Number of customers                -               -               9999      
First Transaction in period        2017-01-01      2020-01-02      2017-01-01
Last Transaction in period         2020-01-01      2020-06-18      2020-06-18
Total # Transactions               59357           3741            63098     
Mean # Transactions per cust       5.936           1.622           6.310     
(SD)                               5.365           1.267           5.867     
Mean Spending per Transaction      400675.173      405535.980      400963.364
(SD)                               902649.352      435747.429      881886.641
Total Spending                     2.38e+10        1.52e+09        2.53e+10  
Total # zero repeaters             110             7693            0         
Percentage # zero repeaters        0.011           0.769           0.000     
Mean Interpurchase time            173.826         47.226          183.307   
(SD)                               164.966         36.601          177.566   
est_pnbd <- pnbd(clv.data = clv_tbs)
Starting estimation...
Estimation finished!
Estimation failed with NA coefs. The returened object contains results but further usage is restricted.Hessian could not be derived. Setting all entries to NA.

What can I do to get the correct results?

Herman

pschil commented 4 years ago

Hi Herman,

without knowing too much about your data and your use case, this likely happened because the optimization breaks. The standard optimization method for pnbd without covariates is L-BFGS-B which breaks if NaN or Inf values are returned from the pnbd LogLikelihood function during model fitting.

You could therefore try another optimization method by specifying it in the optimx.args parameters:

est_pnbd <- pnbd(clv.data = clv_tbs, optimx.args = list(method="Nelder-Mead"))

See ?optimx for what other options are available and see also other examples in ?pnbd.

Alternatively, you could try to use a different period definition such as weeks instead of days which may yield numerically more stable results.

Does this help?

hermandr commented 4 years ago

Hi,

Thank you for responding.

Attached is the data file sample_trx.csv

sample_trx.zip

Below is the code I ran.

summary(sample_trx)

clv_tbs <- clvdata(sample_trx,  
                       date.format="ymd", 
                       time.unit = "days",
                       estimation.split = 365*2,
                       name.id = "customer_id",
                       name.date = "date",
                       name.price = "sales")

est_pnbd <- pnbd(clv.data = clv_tbs,
                 optimx.args = list(method="Rcgmin"))

All other optim algo failed, only Rcgmin passed.

Summary of the estimate is as follows

Pareto NBD Standard  Model 

Call:
pnbd(clv.data = clv_tbs, optimx.args = list(method = "Rcgmin"))

Fitting period:                               
Estimation start  2017-01-01   
Estimation end    2019-01-01   
Estimation length 730.0000 Days

Coefficients:
       Estimate Std. Error     z-val Pr(>|z|)    
r     1.000e+00  4.383e-11 2.282e+10   <2e-16 ***
alpha 1.000e+00  3.099e-07 3.227e+06   <2e-16 ***
s     1.000e+00  4.383e-11 2.282e+10   <2e-16 ***
beta  1.000e+00  3.099e-07 3.227e+06   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Optimization info:                   
LL     -286582.0346
AIC    573172.0692 
BIC    573200.9105 
KKT 1  FALSE       
KKT 2  FALSE       
fevals 1.0000      
Method Rcgmin      

Used Options:                 
Correlation FALSE

plot is as follows:

000003 (22)

Why is the estimate so bad?

Herman

mmeierer commented 4 years ago

Hi Herman,

thanks for your message.

In your output, both, KKT1 & KKT2 are FALSE. This indicates that the model did not properly fit the data.

Looking at your data, I wonder if you are applying the data to a customer cohort or a random sample of the customer base.

I realize that the documentation at https://www.clvtools.com/articles/CLVTools.html does not provide currently to many details on this. Please be aware that the standard approach for any probablistic model is to apply these models to a specific customer cohort, i.e. customers that have been acquired in the same week/month/quarter. In the future, we will add more details on this to the walkthrough.

An important question which you have to look into is, how many new customer you acquire per week/month/quarter/etc. Obviously, this number likely varies as your firm growths. Ensure that you have at least >500 new customers in each cohort, more is better.

If you short target different segments (B2C / B2B) it makes sense to model these customers separately.

On a related note, in your case it seems like it would make sense to use "weeks" instead of "days" when creating the data object. Note that changing the argument to time.unit=“weeks” does only affect the scale. It does not change or aggregate your data (see walkthrough document for further details).

If you follow these advices, your results should look more like the results provided here: https://www.clvtools.com/articles/CLVTools.html

Best, Markus

hermandr commented 4 years ago

Markus,

Thank you for responding. I am now applying it to a customer cohort of Jan 2017.

trx_train_test.zip

clv_tbs <- clvdata(trx_train_test,  
                       date.format="ymd", 
                       time.unit = "days",
                       estimation.split = 365*2,
                       name.id = "customer_id",
                       name.date = "date",
                       name.price = "sales")

est_pnbd <- pnbd(clv.data = clv_tbs,
                 optimx.args = list(method="Nelder-Mead"))

cat("--summary--\n")
summary(est_pnbd)
cat("--coef--\n")
coef(est_pnbd)
cat("--confint--\n")
confint(est_pnbd)
cat("--logLik--\n")
logLik(est_pnbd)
cat("--vcov--\n")
vcov(est_pnbd)

Output:

--summary--
Pareto NBD Standard  Model 

Call:
pnbd(clv.data = clv_tbs, optimx.args = list(method = "Nelder-Mead"))

Fitting period:                               
Estimation start  2017-01-01   
Estimation end    2019-01-01   
Estimation length 730.0000 Days

Coefficients:
       Estimate Std. Error  z-val Pr(>|z|)    
r     1.217e+00  2.281e-02  53.36   <2e-16 ***
alpha 2.009e+02  3.554e+00  56.52   <2e-16 ***
s     4.391e-02  5.622e-04  78.11   <2e-16 ***
beta  3.345e-01  1.884e-03 177.56   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Optimization info:                   
LL     -468959.3987
AIC    937926.7974 
BIC    937959.3512 
KKT 1  FALSE       
KKT 2  FALSE       
fevals 445.0000    
Method Nelder-Mead 

Used Options:                 
Correlation FALSE
--coef--
           r        alpha            s         beta 
  1.21699509 200.87259055   0.04391286   0.33452432 
--confint--
             2.5 %       97.5 %
r       1.17229163   1.26169854
alpha 193.90671726 207.83846384
s       0.04281095   0.04501478
beta    0.33083165   0.33821698
--logLik--
'log Lik.' -468959.4 (df=4)
--vcov--
                  r         alpha             s          beta
r      5.202188e-04  0.0711994293  9.064852e-06 -4.086495e-05
alpha  7.119943e-02 12.6315009285  7.987759e-04 -6.385095e-03
s      9.064852e-06  0.0007987759  3.160834e-07 -5.910235e-07
beta  -4.086495e-05 -0.0063850951 -5.910235e-07  3.549639e-06

000009 (5)

KKT1 and KKT2 are FALSE but the plot looks reasonable. Other than "Nelder-Mead" does not work returns with invalid results.

What do I need to do with the transaction file to ensure that other algos will work?

Herman

pschil commented 4 years ago

Although the comment was marked as resolved, I leave this answer as a note for future users that come across this issue.

"The other algos" here refer to the optimization method used to minimize the LogLikelihood (LL) function. The LL for the pnbd model contains the hypergeometric function 2F1 which is notoriously difficult to compute. If the hypergeom cannot be calculated on your data for some parameter combination, the LL will return NA or Inf values. Unfortunately, not all optimization methods can handle NA/Inf being returned from the target function and will stop immediately, notably method BFGS. You can therefore not expect all optimization methods available in optimx to work with your data. You can use optimx.args=list(control=list(trace=6)) to diagnose the optimization progress.

Nelder-Mead can deal with non-finite returns and would therefore be recommended in your case. We now plan to set Nelder-Mead as the default for all methods, see #119.

As already pointed out previously by Markus and me, it would also be advisable to use weekly time.units instead of daily. Unless your customers often make multiple purchases per week, daily time.units are likely unnecessary and will result in large, unstable parameter estimates.

When using weekly (or yearly) time units, the data is internally still represented using the Date class and you can still predict() until a single days using prediction.end ="x-y-x", eg predict(prediction.end="2019-06-15") or predict(, prediction.end="2019-06-16").

Using your single cohort sample data, the tracking plot with weekly time units also looks much better:

plot(pnbd(clvdata(dt.sample, date.format = "ymd", time.unit = "w", estimation.split = "2018-12-31", name.id = "customer_id", name.date = "date", name.price = "sales")))

(note also KKTs are true)

image

On a side note, "to a customer cohort of Jan 2017" does not seem to hold for your data: dt.sample[, .(date.first.trans=min(date)), by="customer_id"] shows that many customers have their first transaction after Jan 2017. (ie customer 19595629 on 2018-12-31).

hermandr commented 4 years ago

Thanks for the input