Fairness Evaluation - Githubissues

Oufattole commented 2 months ago

Currently if you want to train and evaluate a model you run the following script:

#!/bin/bash
set -e  # Exit immediately if a command exits with a non-zero status.
source $(conda info --base)/etc/profile.d/conda.sh
conda activate fairness
experiment=triplet_mtr
ROOT_DIR=??? # Add a Root directory with your meds folder
TENSOR_DIR=${ROOT_DIR}/${tensor_dir}_tensors/
OUTPUT_DIR=${ROOT_DIR}/fairness_results/${METHOD}/${experiment}/${task_name}/
MEDS_DIR="${ROOT_DIR}/meds/"
TRAIN_DIR=${OUTPUT_DIR}/supervised/train/
TASKS_DIR=${MEDS_DIR}/tasks/
task_name=mortality/in_hospital/first_24h
CONFIGS_FOLDER="MIMICIV_INDUCTIVE_EXPERIMENTS"
tensors_dir=triplet
TENSOR_DIR=${ROOT_DIR}/${tensors_dir}_tensors/

# Train a model
meds-torch-train \
        experiment=$experiment paths.data_dir=${TENSOR_DIR} \
        paths.meds_cohort_dir=${MEDS_DIR} paths.output_dir=${TRAIN_DIR} \
        data.task_name=$task_name data.task_root_dir=$TASKS_DIR \
        hydra.searchpath=[pkg://meds_torch.configs,$(pwd)/${CONFIGS_FOLDER}/configs/meds-torch-configs]
# Evaluate a model
meds-torch-eval \
        experiment=$experiment paths.data_dir=${TENSOR_DIR} \
        paths.meds_cohort_dir=${MEDS_DIR} paths.output_dir=${TRAIN_DIR} \
        data.task_name=$task_name data.task_root_dir=$TASKS_DIR \
        hydra.searchpath=[pkg://meds_torch.configs,$(pwd)/${CONFIGS_FOLDER}/configs/meds-torch-configs]

And you get results in $(meds-torch-latest-dir path=${TRAIN_DIR})/results_summary.parquet that include test/auc among other metrics on the full dataset. It would be useful to have a fairness evaluation where a user can define a subgroup of patients to store these metrics for.

Ideally a user can add the kwarg eval_groups=CODE//NAME and we evaluate metrics (like AUC) specific to that group, rather than everyone in the dataset. This kwarg can be added by modifying the eval.yaml. To start, we can assume CODE//NAME is a static feature, such as the codeGENDER in MIMICIV, which would be stored in the static_df. So I think we need to add some multiclass labels to the batch in the pytorch_dataset class based on this.

I think we just need to update the test_step function in the SupervisedModule here like so:

    def test_step(self, batch, batch_idx):
        output: OutputBase = self.forward(batch)
        # logs metrics for each training_step, and the average across the epoch
        self.test_acc.update(output.logits.squeeze(), batch[self.task_name].float())
        self.test_auc.update(output.logits.squeeze(), batch[self.task_name].float())
        self.test_apr.update(output.logits.squeeze(), batch[self.task_name].int())
        if self.config.eval_groups:
                ... # compute and log fairness metrics

        self.log("test/loss", output.loss, batch_size=self.cfg.batch_size)
        return output.loss

Oufattole commented 2 months ago

Fixing issue #9 would be helpful here

Oufattole commented 2 months ago

The main assumption is that this subgroup evaluation is being performed on a supervised task.

We are assuming a list of task dataframes is defined following the meds v0.3.3 label schema, and that subgroups are defined in the categorical_value column. We also assume we have a matching subject_id and prediction_time with the task dataframe for the supervised training task. We do this so that users can define more flexible subgroup labels for analyzing model performance.

Users can define a eval_groups_fp arg that points to each of these subgroup dataframes, and performance metrics such as AUC and accuracy will be computed based on these subgroups.

We can then add a torchmetrics class for this subgroup metric generation to allow users to tune fairness metrics in hparam sweeps or just evaluate it for training. See this diff for a demo of the plan.

Do you have any thoughts on this @mmcdermott

We are also considering instead of using torch metrics, just storing the model predictions and having a separate evaluation phase.

mmcdermott commented 1 month ago

Looping in @Jeanselme

mmcdermott commented 1 month ago

But to answer your question @Oufattole, I think we should keep training and evaluation separate. See https://github.com/kamilest/meds-evaluation/tree/main for the existing evaluation package. In addition, @Jeanselme has been thinking about fairness evaluations in that context.

Oufattole commented 1 month ago

Oh I see @mmcdermott, so ideally we store the predictions in some schema, maybe the med's label schema, and then have some external evaluation package that can do this subgroup analysis. And for now we would just store the subject_id, prediction_time, and prediction logits in the float_value column? Additionally, maybe embeddings should be an additional column of type array[float64] so users can easily access those?

Is the vision that in the evaluation, we just use aces again? For example, you can take the same aces task schema that generated the labels used for training, and then replace the label (which is the binary classification task label we train on) with a categorical (or binary) subgroup label indicating membership in some subgroup you want to evaluate performance on.

Oufattole commented 1 month ago

I see meds-evaluation already has a solid schema, I made a pull request to communicate this in the readme for the repo: https://github.com/kamilest/meds-evaluation/pull/8

Oufattole / meds-torch

Fairness Evaluation #48