"estimation split is too short" error

bachmannpatrick / CLVTools

R-Package for estimating CLV

54 stars 14 forks source link

"estimation split is too short" error #101

Closed Dennishi0925 closed 4 years ago

Dennishi0925 commented 4 years ago

Thank you all for this package, which is extremely useful for CRM analytics. I have an issue related to the apparelTrans data.

With the code below:

library(CLVTools)
library(tidyverse)
data("apparelTrans")
apparelTrans %>% as_tibble() %>% arrange(Id) %>%
  group_by(Id) %>% summarise(Date = min(Date)) %>%
  count(Date)

It shows that the first transaction of each customer starts from '2005-01-03' so the estimation.split argument can be set easily. However, in terms of real-life transaction data, customers start their first transaction in different dates. I met an error with error message when using my own data. The error message is attached below:

Error: The estimation split is too short! Not all customers of this cohort had their first actual transaction until the specified estimation.split!

Take this sample data for example:

Id         Date               Price
<chr> <date>          <dbl>
 01       2005-01-03  27.0
 01       2005-02-25  20.0
 01       2005-03-01  25.0
 02       2005-01-13  39.0
 02       2005-02-25  93.0
 02       2005-03-25  92.0
 03       2005-03-13  29.0
 03       2005-03-18  13.0
 03       2005-03-25  11.0

Since I want to predict customer churn using 2005-01 ~ 2005-02 as training set and 2005-03 as validation set, I set estimation.split with '2005-03-01'. But it will occur the error since the first transaction date of user 03 starts from March.

One way to prevent this is simply removing samples like user 03. Is there any method or would you recommend any technique to prevent the situation? Thanks for you help.

pschil commented 4 years ago

Hi Dennishi0925,

glad to hear you like our package.

We have implemented it this way on purpose. Customer which are not alive during the estimation period (ie they are not part of the cohort) should also not be present when calculating summary statistics, plotting transactions, etc for the holdout period because they do not belong to this cohort. The way to ensure this, is to remove them from the data.

Choosing estimation end is closely related to how you define your cohorts and to which cohorts your customers belong. The holdout transaction data serves as the "validation" set on which the model's performance is evaluated, for the cohort it was fit on! (@bachmannpatrick please correct me if Im wrong here) The error is therefore also not related to the apparelTrans dataset but to how you define your cohorts, we have used the package successfully on many real life datasets

We could have internally and automatically removed all customers which are not alive in the estimation period. However, we believe that there needs to be transparency about this (ie the customers are not "swallowed" without noticing) and we leave it to the user to remove customers. In this way you are aware:

that customers are removed
which customers are removed

If you still want to make predictions for these users which only come alive during the holdout period, you could supply them in the parameter newdata when calling predict. But be aware of the implications of how you set the estimation end in the object you provide as newdata.

Does that help you?

Also note that while probabilistic CLV models are a reasonable choice analyze aggregate cohort churn patterns, there are other, perhaps better suited models for individual churn prediction and churn management. Search for example this on google scholar for an overview of different types of churn models and their application area.

bachmannpatrick commented 4 years ago

In general, I would recommend the following proceeding when using these types of models:

Test the model fit and prediction accuracy using a holdout period for one or multiple customer cohorts (https://en.wikipedia.org/wiki/Cohort_analysis). All of the customers in these cohorts need to make their first purchase during the estimation period.
If model fit is fine and prediction is accurate, extend the estimation period to the latest date available (estimation.split=NULL) and use all available data to predict future customer behavior (-> no holdout). To do so customer cohorts are no longer required (i.e. you can run just one model over the entire dataset). However, we recommend performing a per cohort analysis (i.e. you run a model for ever cohort), as this usually improves predictive accuracy significantly.
Test the insample fit for all models (e.g. using the plot()).
Predict.

A short remark on the duration of the estimation period. Generally I would use an estimation period that in mimimum as long as the average interpurchase time (longer is usually better). However, you will have to try different settings. Just make sure to test the fit before predicting.

Dennishi0925 commented 4 years ago

Thank @pschil and @bachmannpatrick for your help. Now I understand the setting of clvdata() and I should manually remove samples which purchase only in the holdout period. And also thank pschil for referencing the essay. I am looking into it.

My workflow is now filtering out samples who purchase only in the holdout period, then fit the model for cohorts pre-chosen according to my business target. Then I validate the model with holdout data. After checking this and feel ok, I will use the model to predict future customer behaviors.

I still have a question related to Patrick's second recommendation. "However, we recommend performing a per cohort analysis (i.e. you run a model for ever cohort), as this usually improves predictive accuracy significantly." I am not quite sure about the meaning of "performing a per cohort analysis".

I am a CRM analyst based in Taiwan, Asia. My work relates to customer churning, CLV, A/B testing, uplift modeling, segmentation, user interest tagging, etc. As I know there are not too many CRM-related or marketing science libraries on CRAN or Github. BYTD and BYTDplus are one of them but do not update recently. I would like to thank you again for developing this package.

bachmannpatrick commented 4 years ago

I still have a question related to Patrick's second recommendation. "However, we recommend performing a per cohort analysis (i.e. you run a model for ever cohort), as this usually improves predictive accuracy significantly." I am not quite sure about the meaning of "performing a per cohort analysis".

With "performing a per cohort analysis" I refer to the approach, where you build multiple customer cohorts (usually based on first purchases) and then you apply the model (estimation and prediction!) for every cohort separately. You'll get different model parameters for the different cohorts since customer purchase and dropout behavior often differs between cohorts. Generally, this leads to an improved prediction of future customer activities.

I am a CRM analyst based in Taiwan, Asia. My work relates to customer churning, CLV, A/B testing, uplift modeling, segmentation, user interest tagging, etc. As I know there are not too many CRM-related or marketing science libraries on CRAN or Github. BYTD and BYTDplus are one of them but do not update recently. I would like to thank you again for developing this package.

Thanks! We are committed to continuously improve and extend CLVTools.