angelolab / Nimbus

Other
12 stars 1 forks source link

Multi dataset training and validation (with additional datasets) #39

Closed JLrumberger closed 1 year ago

JLrumberger commented 1 year ago

Instructions

We need to expand ModelBuilder.prep_data to work with lists of tfrecord datasets instead of single datasets, because we'll incorporate more datasets from different sources. We might need to have dataset and channel specific loss cutoffs for promix and we need to be able to supply multiple proof-read validation datasets whose validation performance needs to be stored individually.

Relevant background

The first version of ModelBuilder only uses a single dataset for training and validation which was feasible for proof-of-concept but now more data flows in. Also ModelBuilder.prep_data is inherited by PromixNaive so it needs to be compatible with the rest of the child-class.

Design overview

We'll need to change many small functionalities in ModelBuilder and PromixNaive. ModelBuilder.prep_data must handle multiple datasets and sampling probabilities to sample from each of them. ModelBuilder.tensorboard_callbacks must handle multiple validation datasets and write results to tensorboard individually for each of them. In addition, evaluation_script.py needs some additional functionality to plot performance metrics split for each of the datasets.

Code mockup

This makes no sense for this issue since code changes will span two large classes that and stuff needs to be changed on many locations.

Required inputs

The following arguments in params will change or will be newly introduced

PromixNaive

Output files

Timeline Give a rough estimate for how long you think the project will take. In general, it's better to be too conservative rather than too optimistic.

Estimated date when a fully implemented version will be ready for review: Mid December

Estimated date when the finalized project will be merged in: End of December