Closed Dennishi0925 closed 4 years ago
Hi Dennishi0925,
glad to hear you like our package.
We have implemented it this way on purpose. Customer which are not alive during the estimation period (ie they are not part of the cohort) should also not be present when calculating summary statistics, plotting transactions, etc for the holdout period because they do not belong to this cohort. The way to ensure this, is to remove them from the data.
Choosing estimation end is closely related to how you define your cohorts and to which cohorts your customers belong. The holdout transaction data serves as the "validation" set on which the model's performance is evaluated, for the cohort it was fit on! (@bachmannpatrick please correct me if Im wrong here)
The error is therefore also not related to the apparelTrans
dataset but to how you define your cohorts, we have used the package successfully on many real life datasets
We could have internally and automatically removed all customers which are not alive in the estimation period. However, we believe that there needs to be transparency about this (ie the customers are not "swallowed" without noticing) and we leave it to the user to remove customers. In this way you are aware:
If you still want to make predictions for these users which only come alive during the holdout period, you could supply them in the parameter newdata
when calling predict
. But be aware of the implications of how you set the estimation end in the object you provide as newdata.
Does that help you?
Also note that while probabilistic CLV models are a reasonable choice analyze aggregate cohort churn patterns, there are other, perhaps better suited models for individual churn prediction and churn management. Search for example this on google scholar for an overview of different types of churn models and their application area.
In general, I would recommend the following proceeding when using these types of models:
A short remark on the duration of the estimation period. Generally I would use an estimation period that in mimimum as long as the average interpurchase time (longer is usually better). However, you will have to try different settings. Just make sure to test the fit before predicting.
Thank @pschil and @bachmannpatrick for your help. Now I understand the setting of clvdata()
and I should manually remove samples which purchase only in the holdout period. And also thank pschil for referencing the essay. I am looking into it.
My workflow is now filtering out samples who purchase only in the holdout period, then fit the model for cohorts pre-chosen according to my business target. Then I validate the model with holdout data. After checking this and feel ok, I will use the model to predict future customer behaviors.
I still have a question related to Patrick's second recommendation. "However, we recommend performing a per cohort analysis (i.e. you run a model for ever cohort), as this usually improves predictive accuracy significantly." I am not quite sure about the meaning of "performing a per cohort analysis".
I am a CRM analyst based in Taiwan, Asia. My work relates to customer churning, CLV, A/B testing, uplift modeling, segmentation, user interest tagging, etc. As I know there are not too many CRM-related or marketing science libraries on CRAN or Github. BYTD
and BYTDplus
are one of them but do not update recently. I would like to thank you again for developing this package.
I still have a question related to Patrick's second recommendation. "However, we recommend performing a per cohort analysis (i.e. you run a model for ever cohort), as this usually improves predictive accuracy significantly." I am not quite sure about the meaning of "performing a per cohort analysis".
With "performing a per cohort analysis" I refer to the approach, where you build multiple customer cohorts (usually based on first purchases) and then you apply the model (estimation and prediction!) for every cohort separately. You'll get different model parameters for the different cohorts since customer purchase and dropout behavior often differs between cohorts. Generally, this leads to an improved prediction of future customer activities.
I am a CRM analyst based in Taiwan, Asia. My work relates to customer churning, CLV, A/B testing, uplift modeling, segmentation, user interest tagging, etc. As I know there are not too many CRM-related or marketing science libraries on CRAN or Github.
BYTD
andBYTDplus
are one of them but do not update recently. I would like to thank you again for developing this package.
Thanks! We are committed to continuously improve and extend CLVTools
.
Thank you all for this package, which is extremely useful for CRM analytics. I have an issue related to the
apparelTrans
data.With the code below:
It shows that the first transaction of each customer starts from '2005-01-03' so the
estimation.split
argument can be set easily. However, in terms of real-life transaction data, customers start their first transaction in different dates. I met an error with error message when using my own data. The error message is attached below:Error: The estimation split is too short! Not all customers of this cohort had their first actual transaction until the specified estimation.split!
Take this sample data for example:
Since I want to predict customer churn using 2005-01 ~ 2005-02 as training set and 2005-03 as validation set, I set
estimation.split
with '2005-03-01'. But it will occur the error since the first transaction date of user03
starts from March.One way to prevent this is simply removing samples like user
03
. Is there any method or would you recommend any technique to prevent the situation? Thanks for you help.