Set up infrastructure for exporting results

emilycantrell commented 3 weeks ago

Tommy & Emily discussed this on 2024-08-23.

To do:

[x] Figure out what information we want to export, and in what format (CSV? json?)
[ ] Set up the code to do this export (Emily will draft suggested code for this; then Tommy can adjust it to work in the actual environment)

Proposed columns to include in the export file (tentative; this needs more discussion) (these are just notes about ideas, not intended as a spec of actual column names):

training sample size
selection sample size
eval sample size
indicator for whether tuning/selection data was separate from evaluation data
training & selection sample composition (i.e., demographic subgroup composition)
eval sample composition (i.e., demographic subgroup composition)
model (e.g., catboost/xgboost)
hyperparameter_1 (this will be a specific hyperparameter name, depending on the model)
hyperparameter_2
... hyperparameter_n
feature set (which files do we use as inputs)
threshold for turning probabilities into classes
F1
precision
accuracy
recall
AUC
R2_holdout
data for calibration plots? (need to think about how to export this without exporting individual-level data or data with fewer than 10 people per cell)

emilycantrell commented 2 weeks ago

This afternoon, Tommy and I met and discussed what results we want to export, and how to format the exported file.

What results to export

We want to export the following information:

For all Model-HyperparameterDraw-TrainingSample-FeatureSets that we tested, export scores that can be calculated using predicted probabilities (logloss, AUC, R^2, squared loss), for every selection set and test set specified in the jobfile. (Note: "draw" refers to a draw during randomized grid search)
For the winning HyperparameterDraw within each Model-TrainingSample-FeatureSet (where the winner is chosen based on a specified selection set):
- export scores that are calculated based on classifications (F1, precision, accuracy, recall), for every selection set and test set specified in the jobfile. (This is in addition to logloss, AUC, and R^2.)
- export the classification threshold

Notes:

The reason we are exporting results from every hyperparameter draw we test, rather than just the winning draw, is that Tommy is interesting in studying how various "reasonable models" perform on different demographic subgroups. He wants to see a variety of models, not just a single "best" model. (@HanzhangRen I'd like to discuss what you mean by "reasonable models" more.)
The reason we are exporting classification-based scores for only the winning HyperparameterDraw within each Model-TrainingSample-FeatureSet, rather than for all Model-HyperparameterDraw-TrainingSample-FeatureSets, is that we don't think it's worth the computation time to determine the best thresholds for the predictions from every hyperparameter draw. (However, @HanzhangRen, on further thought I'm wondering: are you sure you don't want classification-based scores for all hyperparameter draws?)
We haven't yet touched on if/how to export calibration plot data, but that is also something we need to decide.

Clearly we also need to figure out better terminology :) The word "pipeline" might be useful here, but for now I am trying to be as explicit as possible about what I'm referring to.

How to format the export file

We also discussed how to format the export file. I've attached a draft of how the export file might be formatted. This is far from final, but we can use it as a starting place for discussion tomorrow.

draft_export_format.xlsx

We are inclined to use csv as the export format for two reasons: (1) in R, it's easiest to work with rectangular data, (2) the CBS export guidelines are based on the number of "cells" that will be exported, so they are probably most familiar with rectangular exports. However, we are open to using json or some alternative format if there is a compelling reason.

Meeting tomorrow

@HanzhangRen @msalganik @jcperdomo Figuring out what to export and how to format it has been more challenging than I expected. When we have our Stork Oracle meeting tomorrow, I'd like to start by discussing WHAT we want to export. Once we finalize that decision, we can return to HOW we want to export it.

emilycantrell commented 2 weeks ago

Discussed 2024-08-29 (Tommy & Emily meeting):

In the export file, we also want a column for feature importance (maybe just for winning pipelines; this will be especially interesting when we fit the models on different subpopulations)

emilycantrell commented 2 weeks ago

Discussed 2024-08-29 (Stork Oracle meeting):

Updated list of things that we want to have in the results:

For all Model-HyperparameterDraw-TrainingSample-FeatureSets that we tested, export scores that can be calculated using predicted probabilities (logloss, AUC, R^2, squared loss), for every selection set and test set specified in the jobfile. (Note: "draw" refers to a draw during randomized grid search)
For the winning HyperparameterDraw within each Model-TrainingSample-FeatureSet (where the winner is chosen based on a specified selection set):
- export scores that are calculated based on classifications (F1, precision, accuracy, recall), for every selection set and test set specified in the jobfile. (This is in addition to logloss, AUC, and R^2.)
- export the classification threshold
- Confidence intervals for all scores (calculated through bootstrap -- bootstrap the predictions in the calculation of the score, to get uncertainty reflecting who is in the evaluation sample (train set stays fixed). Do the bootstrapping on the predicted probabilities, before setting the threshold.)
- Data for a calibration plot ~(one column for each decile)~
Maybe: a scaler metric of calibration
Maybe: to assist CBS in sample size questions when they do the export, add a column specifying the sample size of the set for which the score is calculated.

See: https://www.tidyverse.org/blog/2022/11/model-calibration/

emilycantrell commented 2 weeks ago

During the meeting we talked about exporting calibration plot data as follows: for each decile of probabilities (or whatever percentile/width we choose), create a column that contains the fraction of positives. (separately, we would also have a column with a scaler metric of calibration)

After the meeting, Tommy pointed out that based on the current file structure, we would not be adding additional columns, we would be adding additional rows. This would make a dataframe that is already very long into a dataframe that is much, much longer.

We will think about whether to put the results file in wide format instead of long. Alternatively, the calibration data can be in a separate file rather than in the main results file.

msalganik commented 2 weeks ago

@varunsatish given our conversation this afternoon, you might want to read this before the meeting on Friday. Very related to what we were talking about.

emilycantrell commented 2 weeks ago

Proposal to switch scores storage from long to wide

I propose that rather than storing the scores in long format, we store them in wide format. This will substantially reduce the number of cells in the output file. For example, v1 and v2 (attached) contain identical amounts of information, but v1 has 250 cells, and v2 has 110 cells. The proportional gap between them will get even larger when we include all the scores we want to export, and especially if we add calibration data using the same format as the overall scores.

draft_export_format_v1.xlsx draft_export_format_v2_with_same_data_as_v1_to_compare_cell_counts.xlsx (data in the drafts is fake)

Thoughts?

These drafts don't contain all the edits discussed today. I'll post a fully updated draft before we meet tomorrow.

msalganik commented 2 weeks ago

Interesting @emilycantrell. At first I had trouble noticing the difference between the two but then I figured it out. It seems like the general principle you are following is trying to reduce the amount of redundancy. V2 is shorter because it does not repeat the same information many times.

I wonder if you could take this even further by making two tables and then joining them together. For example, you could have one table that stores run_ids and data about each run (the stuff from the job file). Then you could have a results table that includes the run_ids and the results. Then you could export them and merge outside the RA.

That said, I'm not sure how much reducing the size of the output actually matters.

Here's something from Mark "We will have to reformat this into an excel and ideally keep it below 1,000 cells across less than 8 tables. Then it will qualify as a light output. Is it possible to reformat this output in such a format and share? Then I can share that with CBS. They asked for an example wrt bulk"

msalganik commented 2 weeks ago

@varunsatish I think this comment from Mark from Slack is important.
"We will have to reformat this into an excel and ideally keep it below 1,000 cells across less than 8 tables. Then it will qualify as a light output. Is it possible to reformat this output in such a format and share? Then I can share that with CBS. They asked for an example wrt bulk" Do you think we can get the Cruijff results out with less than 1,000 cells and less than 8 tables?

varunsatish commented 2 weeks ago

@msalganik Yes. The untidy version I showed you yesterday was at about 1020 cells.

emilycantrell commented 2 weeks ago

Discussed 2024-08-30 (EC VS HR MS):

We will store results files on OneDrive, using a descriptive naming scheme that includes the date and the purpose of the run.

emilycantrell / stork_oracle_cbs