Added worst n plots and heatmaps of per cell precision, recall, spec, f1

What is the purpose of this PR?

This PR closes #45 and adds functionality to automatically generate validation dataset performance metric plots and tables and visualizations of the worst n predictions. In a future PR this will be extended for the case of multiple validation datasets.

How did you implement your changes

A heatmap visualization of performance metrics split by marker or cell_type is added as heatmap_plot in plot_utils.py. Resulting plots look like this heatmap_split_by_marker

Precision, recall, specificity and f1-scores vs. threshold are plotted together via plot_metrics_against_threshold instead of in a facetplot as written in the design doc #45. The resulting plots look like this: precision_recall_f1_cell_lvl

I changed some bits in evaluation_script.py to plot the worst n predictions and to save the predicted cell_table. Worst n predictions are visualized like this: worst_0_MSK_colon_10b0e9ade389_Colon P20 CD3, Foxp1, PDL1, ICOS, CD8, panCK+CK7+CAM5 2__ 52459,17275 _image

Remaining issues

Is this really what we need for fast evaluation of our experiments or am I missing something?
I could've also included the functionality from evaluation_script.py as a class function in ModelBuilder.py, but I don't see a clear advantage of it yet. Do you have opinions on this?

angelolab / Nimbus

Added worst n plots and heatmaps of per cell precision, recall, spec, f1 #47