Quantitative Metrics to assess model performance

SSMK-wq commented 1 year ago

I went through the documentation of Lifetimes python library and found out that it has two ways to assess model performance. They are

a) By generating synthetic data points (and plotting a bar graph) b) Calibration and hold out data (using bar graph)

But both the above approaches as shown in the doc, only show the visual assessment of a model. Instead, I would like to find the error value between actual and predicted output.

Is there any metrics like accuracy, RMSE, Brier score etc or any other relevant metric for this problem and how can we implement that or use them here?

ColtAllen commented 1 year ago

Hey @SSMK-wq,

A traditional ML approach for model evaluation would be to predict the conditional_expected_number_of_purchases_up_to_time frequencies, then calculate the RMSE, et. al. against the frequencies in your customer data. However, this doesn't account for the other predictive methods like conditional_probability_alive, which are probabilistic in nature and require a different approach.

a) By generating synthetic data points (and plotting a bar graph)

This is the general idea of a Bayesian posterior predictive check, and is the most appropriate way to evaluate these models. One of the metrics for this is the Bayesian p-value:

https://python.arviz.org/en/latest/api/generated/arviz.plot_bpv.html

The above plot will be added in a future btyd release.

SSMK-wq commented 1 year ago

@ColtAllen Sorry, two quick question on model performance.

a) what if model does well when compared with synthetic data distribution but performs poorly on holdout dataset. How should i interpret this scenario?

b) in case of poor model performance, only hyperparameterwecan tune is penalizer coefficient?is there any other way to improve the model performance (other than data)?

ColtAllen commented 1 year ago

a) what if model does well when compared with synthetic data distribution but performs poorly on holdout dataset. How should i interpret this scenario?

This could indicate overfitting if working with smaller datasets, but more than likely suggests the training/calibration and test/holdout datasets are not identically distributed, which will prevent effective model evaluation. If this is the case, further exploratory data analysis is recommended to determine the causes. If changing the start/end dates of the dataset and/or filtering out spurious customers and/or transactions does not resolve the issue, then it may be best to avoid using calibration_and_holdout_data and just train with the full dataset instead.

b) in case of poor model performance, only hyperparameterwecan tune is penalizer coefficient?is there any other way to improve the model performance (other than data)?

I don't like using the penalizer coefficient for these models because it applies an L2 parameter shrinkage similar to that of a ridge regression. This is fine when working with linear models, but these models are far from linear. The models in btyd are fitted via MCMC instead of MLE, and and have initial hyperpriors that can be tuned, but this should only be needed for smaller datasets. If the dataset is sufficiently large (I'd estimate more than 10,000 customers) then the data will overwhelm the initial hyperprior values.

However, regardless of the model you're working with, valid customer data is essential, and due to the statistical assumptions of these models, customers should only be filtered out by T (time since the first transaction) and/or days since last visit (T - recency).

SSMK-wq commented 1 year ago

@ColtAllen - Based on your suggestion above, if we are avoid to calibration_and_holdout_data and train our model (to estimate parameters) based on full dataset, how do we validate our model then? we can compute metrics like RMSE but if there is no train and test, do you think it would be okay? How else do we compare our performance or know that our model will hold good for points other then training data (which is full dataset). So, if you are in a similar situation where you avoid "calibration and holdout data" then how would you validate your model and convince business (that prediction is working)? If it is average difference between actual and predicted value, then we just stick to traditional regression metrics such as R2?

ColtAllen commented 1 year ago

if we are to avoid calibration_and_holdout_data and train our model (to estimate parameters) based on full dataset, how do we validate our model then?

Please refer to my first post in this issue.

CamDavidsonPilon / lifetimes

Quantitative Metrics to assess model performance #438