Multi dataset training and validation (with additional datasets)

Instructions

We need to expand ModelBuilder.prep_data to work with lists of tfrecord datasets instead of single datasets, because we'll incorporate more datasets from different sources. We might need to have dataset and channel specific loss cutoffs for promix and we need to be able to supply multiple proof-read validation datasets whose validation performance needs to be stored individually.

Relevant background

The first version of ModelBuilder only uses a single dataset for training and validation which was feasible for proof-of-concept but now more data flows in. Also ModelBuilder.prep_data is inherited by PromixNaive so it needs to be compatible with the rest of the child-class.

Design overview

We'll need to change many small functionalities in ModelBuilder and PromixNaive. ModelBuilder.prep_data must handle multiple datasets and sampling probabilities to sample from each of them. ModelBuilder.tensorboard_callbacks must handle multiple validation datasets and write results to tensorboard individually for each of them. In addition, evaluation_script.py needs some additional functionality to plot performance metrics split for each of the datasets.

Code mockup

This makes no sense for this issue since code changes will span two large classes that and stuff needs to be changed on many locations.

Required inputs

The following arguments in params will change or will be newly introduced

record_paths: becomes a list of paths instead of a string of a single path
sampling_probs: a list of probabilities used for sampling from each of the datasets.
num_validation: becomes a list that stores the number of validation samples taken from each of the datasets from record_paths
external_validation_paths: contains a list of paths to .tfrecord files that store "external" datasets. These datasets are our hand-annotated datasets and other high-quality data used for more objective validation during training
filter_quantile: becomes a list that holds for every dataset a quantile of cell density, below which we exclude the sample from training

PromixNaive

PromixNaive.batchwise_loss_selection needs to take in the dataset and hand it to PromixNaive.class_wise_loss_selection and Promix.matched_high_confidence_selection, since we expect different error rates in different datasets.
PromixNaive.class_wise_loss_quantiles need to have a nested structure ['dataset']['marker'] instead of only ['marker'] as it is now.

Output files

tensorboard reports will have more validation dataset results

Timeline Give a rough estimate for how long you think the project will take. In general, it's better to be too conservative rather than too optimistic.

[ ] A couple days
[x] A week
[ ] Multiple weeks. For large projects, make sure to agree on a plan that isn't just a single monster PR at the end.

Estimated date when a fully implemented version will be ready for review: Mid December

Estimated date when the finalized project will be merged in: End of December

angelolab / Nimbus

Multi dataset training and validation (with additional datasets) #39

Instructions