refactor: remove y hat int from eval dataset

MartinBernstorff commented 1 year ago

!! Disregard the commits for now. This is more of a discussion of where to go from here. !!

We have a fundamental problem here - a bunch of our functions support any arbitrary metric function, but that means that the input arguments interact with one another.

E.g. roc_auc_score needs y_hat_probs, whereas sensitivity needs y_hat_probs for a given threshold (y_hat_int).

This means that a function like plot_performance_by_sex is super hard to write. Essentially we'd need a function that maps each metric_fn to its inputs. I can see that I got started on that a while ago, metric_fn_to_input, but that hasn't gotten used in the repo.

That makes the interface a pain to work with! I don't think this complexity is worth it to support any arbitrary metric function. Instead, I suggest we only support one metric per function for now, to make life simpler.

[ ] The thresholding logic should definitely be in the EvalDataset

@bokajgd, @sarakolding, @HLasse, @KennethEnevoldsen - any opinions on this?

Follows on #447.

github-actions[bot] commented 1 year ago

Tests	Skipped	Failures	Errors	Time
47	1 :zzz:	0 :x:	0 :fire:	52.029s :stopwatch:

bokajgd commented 1 year ago

Do our models automatically adjust the threshold? Where is this logged? But yes, I feel like it is a decent solution to just support one metric per function for now and keep metric-specific functions even though this means having shared functionality in multiple functions.

MartinBernstorff commented 1 year ago

The threshold is somethign we have to decide on manually - it's based on clinical utility on a case-by-case basis, since the cost of false positives and false negatives are dramatically different between projects 👍

bokajgd commented 1 year ago

The threshold is somethign we have to decide on manually - it's based on clinical utility on a case-by-case basis, since the cost of false positives and false negatives are dramatically different between projects 👍

Yes of course, makes sense. But where do we currently specify this? Or do we just try out thresholds post-training? Because all e.g. sensitivity functions/plots need to be given a threshold and currently they're just using a default 0.5 threshold. :)

MartinBernstorff commented 1 year ago

The threshold is somethign we have to decide on manually - it's based on clinical utility on a case-by-case basis, since the cost of false positives and false negatives are dramatically different between projects 👍

Yes of course, makes sense. But where do we currently specify this? Or do we just try out thresholds post-training? Because all e.g. sensitivity functions/plots need to be given a threshold and currently they're just using a default 0.5 threshold. :)

Ah, I see! We only specify it when calling the functions. Currently leaning towards doing all model eval post-training except ROC-AUC (#452).

For most plots, we want to use the same threshold, so there's definitely an argument for making it a config option somewhere when we do post-training model eval.

signekb commented 1 year ago

Agree with Jakob's point with the threshold. Maybe it could make sense to enable multiple metrics for the plot function, but limit it to e.g., the three most used? Otherwise, it's fine with us to create plots per metric for now and then create additional plots when we need them 😊

MartinBernstorff commented 1 year ago

Excellent, thanks! I'll pare it down to the simplest possible for now, and then we can add complexity when we need it 👍

Aarhus-Psychiatry-Research / psycop-model-training

refactor: remove y hat int from eval dataset #454

!! Disregard the commits for now. This is more of a discussion of where to go from here. !!