Open Oufattole opened 2 months ago
Fixing issue #9 would be helpful here
The main assumption is that this subgroup evaluation is being performed on a supervised task.
We are assuming a list of task dataframes is defined following the meds v0.3.3 label schema, and that subgroups are defined in the categorical_value
column. We also assume we have a matching subject_id
and prediction_time
with the task dataframe for the supervised training task. We do this so that users can define more flexible subgroup labels for analyzing model performance.
Users can define a eval_groups_fp
arg that points to each of these subgroup dataframes, and performance metrics such as AUC and accuracy will be computed based on these subgroups.
We can then add a torchmetrics class for this subgroup metric generation to allow users to tune fairness metrics in hparam sweeps or just evaluate it for training. See this diff for a demo of the plan.
Do you have any thoughts on this @mmcdermott
We are also considering instead of using torch metrics, just storing the model predictions and having a separate evaluation phase.
Looping in @Jeanselme
But to answer your question @Oufattole, I think we should keep training and evaluation separate. See https://github.com/kamilest/meds-evaluation/tree/main for the existing evaluation package. In addition, @Jeanselme has been thinking about fairness evaluations in that context.
Oh I see @mmcdermott, so ideally we store the predictions in some schema, maybe the med's label schema, and then have some external evaluation package that can do this subgroup analysis. And for now we would just store the subject_id
, prediction_time
, and prediction logits in the float_value
column? Additionally, maybe embeddings should be an additional column of type array[float64]
so users can easily access those?
Is the vision that in the evaluation, we just use aces again? For example, you can take the same aces task schema that generated the labels used for training, and then replace the label (which is the binary classification task label we train on) with a categorical (or binary) subgroup label indicating membership in some subgroup you want to evaluate performance on.
I see meds-evaluation already has a solid schema, I made a pull request to communicate this in the readme for the repo: https://github.com/kamilest/meds-evaluation/pull/8
Currently if you want to train and evaluate a model you run the following script:
And you get results in
$(meds-torch-latest-dir path=${TRAIN_DIR})/results_summary.parquet
that includetest/auc
among other metrics on the full dataset. It would be useful to have a fairness evaluation where a user can define a subgroup of patients to store these metrics for.Ideally a user can add the kwarg
eval_groups=CODE//NAME
and we evaluate metrics (like AUC) specific to that group, rather than everyone in the dataset. This kwarg can be added by modifying the eval.yaml. To start, we can assumeCODE//NAME
is a static feature, such as the codeGENDER
in MIMICIV, which would be stored in the static_df. So I think we need to add some multiclass labels to the batch in thepytorch_dataset
class based on this.I think we just need to update the
test_step
function in the SupervisedModule here like so: