Our model evaluations currently look at models independently. We should also compare models. Here are some of the things to look at:
Correlation matrices between model predictions
Jaccard similarity and rank-order correlations
Webapp should show display model accuracy for simple comparison, e.g. sort by accuracy
Cluster models
Predict model performance from model characteristics/configurations, e.g. type of model (random forest, logistic regression) is a feature, size of the time window is a feature, time period is a feature, hyperparameters are features, etc. That can help uncover patterns
The webapp should show how stable/unstable model performance is over time.
Within-model evaluation:
This is an absolute measure. Plot precision/recall/ROC AUC/etc over time
Between-model evaluation:
plot rank-order correlation from one period to the next (i.e. do the same models consistently appear at the top?), maybe Jaccard similarity for models in top k models
Our model evaluations currently look at models independently. We should also compare models. Here are some of the things to look at: