We need to expand ModelBuilder.prep_data to work with lists of tfrecord datasets instead of single datasets, because we'll incorporate more datasets from different sources. We might need to have dataset and channel specific loss cutoffs for promix and we need to be able to supply multiple proof-read validation datasets whose validation performance needs to be stored individually.
Relevant background
The first version of ModelBuilder only uses a single dataset for training and validation which was feasible for proof-of-concept but now more data flows in. Also ModelBuilder.prep_data is inherited by PromixNaive so it needs to be compatible with the rest of the child-class.
Design overview
We'll need to change many small functionalities in ModelBuilder and PromixNaive. ModelBuilder.prep_data must handle multiple datasets and sampling probabilities to sample from each of them. ModelBuilder.tensorboard_callbacks must handle multiple validation datasets and write results to tensorboard individually for each of them. In addition, evaluation_script.py needs some additional functionality to plot performance metrics split for each of the datasets.
Code mockup
This makes no sense for this issue since code changes will span two large classes that and stuff needs to be changed on many locations.
Required inputs
The following arguments in params will change or will be newly introduced
record_paths: becomes a list of paths instead of a string of a single path
sampling_probs: a list of probabilities used for sampling from each of the datasets.
num_validation: becomes a list that stores the number of validation samples taken from each of the datasets from record_paths
external_validation_paths: contains a list of paths to .tfrecord files that store "external" datasets. These datasets are our hand-annotated datasets and other high-quality data used for more objective validation during training
filter_quantile: becomes a list that holds for every dataset a quantile of cell density, below which we exclude the sample from training
PromixNaive
PromixNaive.batchwise_loss_selection needs to take in the dataset and hand it to PromixNaive.class_wise_loss_selection and Promix.matched_high_confidence_selection, since we expect different error rates in different datasets.
PromixNaive.class_wise_loss_quantiles need to have a nested structure ['dataset']['marker'] instead of only ['marker'] as it is now.
Output files
tensorboard reports will have more validation dataset results
Timeline
Give a rough estimate for how long you think the project will take. In general, it's better to be too conservative rather than too optimistic.
[ ] A couple days
[x] A week
[ ] Multiple weeks. For large projects, make sure to agree on a plan that isn't just a single monster PR at the end.
Estimated date when a fully implemented version will be ready for review: Mid December
Estimated date when the finalized project will be merged in: End of December
Instructions
We need to expand
ModelBuilder.prep_data
to work with lists of tfrecord datasets instead of single datasets, because we'll incorporate more datasets from different sources. We might need to have dataset and channel specific loss cutoffs for promix and we need to be able to supply multiple proof-read validation datasets whose validation performance needs to be stored individually.Relevant background
The first version of
ModelBuilder
only uses a single dataset for training and validation which was feasible for proof-of-concept but now more data flows in. AlsoModelBuilder.prep_data
is inherited byPromixNaive
so it needs to be compatible with the rest of the child-class.Design overview
We'll need to change many small functionalities in
ModelBuilder
andPromixNaive
.ModelBuilder.prep_data
must handle multiple datasets and sampling probabilities to sample from each of them.ModelBuilder.tensorboard_callbacks
must handle multiple validation datasets and write results to tensorboard individually for each of them. In addition,evaluation_script.py
needs some additional functionality to plot performance metrics split for each of the datasets.Code mockup
This makes no sense for this issue since code changes will span two large classes that and stuff needs to be changed on many locations.
Required inputs
The following arguments in params will change or will be newly introduced
PromixNaive
PromixNaive.batchwise_loss_selection
needs to take in the dataset and hand it toPromixNaive.class_wise_loss_selection
andPromix.matched_high_confidence_selection
, since we expect different error rates in different datasets.PromixNaive.class_wise_loss_quantiles
need to have a nested structure ['dataset']['marker'] instead of only ['marker'] as it is now.Output files
Timeline Give a rough estimate for how long you think the project will take. In general, it's better to be too conservative rather than too optimistic.
Estimated date when a fully implemented version will be ready for review: Mid December
Estimated date when the finalized project will be merged in: End of December