bachmannpatrick / CLVTools

R-Package for estimating CLV
54 stars 14 forks source link

Sampling strategy for model fitting #182

Closed chenx2018 closed 2 years ago

chenx2018 commented 3 years ago

Hi everyone,

I am currently working on a multichannel apparel retailing dataset with time range from 2019 to 2021. The data contains around 600 thousand customers and more than 15 million transactions. I choose 1-year time span to build the cohort group. Since the dataset is quite huge, I try to do some sampling at first and then start fitting models with time-(in)varying covariates. My goal is to select some "representative" customers to reduce the data size and computation cost. Using the "good" sampling data, the model should have the similar parameter estimators as the "full-model" (fitted with all data).

Is there any suggestion for the sampling strategy?

Thanks a lot!

mmeierer commented 2 years ago

Sounds like an interesting task. Did you meanwhile get good results with a particular strategy?

We have not yet come across this issue in our applications and thus, did not have any other recommendation other than random sampling with a reasonable high sample size per cohort. Some information on sample sizes for these models is given in https://link.springer.com/article/10.1007/s11573-021-01057-6.

Future versions of CLVTools will likely show very strong speed improvements. Thus, running the models on all observations should be a viable alternative for large datasets.

pschil commented 2 years ago

I shortly extend on the previous answer by Markus which gives important guidelines for sample sizes. As i understood you want to select a sub-sample and fit the extended Pareto/NBD with dynamic covariates on it instead of on the full data set. If you simply sample customers randomly they will on average be representative but you might be unlucky and get a sample, which when you fit the model on it, gives you parameters completely different from the ones that would have resulted had you fit on the full data.

So you would need to sample such that you end up with a group of customers which represent the original group well - for the purpose of model fitting! This means that the sample should at least be representative with respect to the CBS values, namely x, t.x and T.

For the extended PNBD with dynamic covariates, you further want to sample based on the covariate data to ensure all parameters are representative. You therefore also need to give every customers' covariate data some value based on which you can sample. (unless of course your covariates are the same for all customers, say seasonality) For this I can think of either a a similarity measure between the covariate data (some distance measure for multivariate time-series) or an identifier that describes the covariate data based on which you can group customers with identical covariates together (a hash value of the cov data). If you have many different covariates, a similarity measure is likely better suited. If you only have a few different covariates, grouping customers by their covariate data with a hash value is probably best.

Given the CBS values (x, t.x, T) and the representation of the covariate data (if it is not all the same), you can then do some form of stratified sampling based on them.

To get the CBS, either fit the standard model or use the internal function:

# std model
p.res <- pnbd(clv.data)
dt.cbs <- p.res@cbs
# internal function
dt.cbs <- CLVTools:::pnbd_cbs(clv.data)

Then use your preferred package to sample from strata based on the x, t.x, and T values in dt.cbs and the representation of the covariate data. Dont forget to especially check the distribution of the sampled CBS data vs the full data CBS.

I suspect that, probably depending on your data, sampling based alone on x often will be good enough because a) it often will be highly correlated with t.x and b) sampling based on T.cal will often also not be needed if the cohorting window is not too long compared to the length of the whole estimation period. In general the parameters for the dropout process (s, beta) will likely vary more than for the purchase process (r, alpha). But I do have no idea how much of a difference it will make vs fitting the full model…..