This PR closes #45 and adds functionality to automatically generate validation dataset performance metric plots and tables and visualizations of the worst n predictions. In a future PR this will be extended for the case of multiple validation datasets.
How did you implement your changes
A heatmap visualization of performance metrics split by marker or cell_type is added as heatmap_plot in plot_utils.py. Resulting plots look like this
Precision, recall, specificity and f1-scores vs. threshold are plotted together via plot_metrics_against_threshold instead of in a facetplot as written in the design doc #45. The resulting plots look like this:
I changed some bits in evaluation_script.py to plot the worst n predictions and to save the predicted cell_table. Worst n predictions are visualized like this:
Remaining issues
Is this really what we need for fast evaluation of our experiments or am I missing something?
I could've also included the functionality from evaluation_script.py as a class function in ModelBuilder.py, but I don't see a clear advantage of it yet. Do you have opinions on this?
Added the worst n / best n images and a flag to split them by marker. I had to do some ugly force pushes, because I worked from different computers without fetching the remote branch before making changes 😞
What is the purpose of this PR?
This PR closes #45 and adds functionality to automatically generate validation dataset performance metric plots and tables and visualizations of the worst n predictions. In a future PR this will be extended for the case of multiple validation datasets.
How did you implement your changes
A heatmap visualization of performance metrics split by marker or cell_type is added as
heatmap_plot
inplot_utils.py
. Resulting plots look like thisPrecision, recall, specificity and f1-scores vs. threshold are plotted together via
plot_metrics_against_threshold
instead of in a facetplot as written in the design doc #45. The resulting plots look like this:I changed some bits in
evaluation_script.py
to plot the worst n predictions and to save the predicted cell_table. Worst n predictions are visualized like this:Remaining issues
evaluation_script.py
as a class function inModelBuilder.py
, but I don't see a clear advantage of it yet. Do you have opinions on this?