angelolab / Nimbus

Other
12 stars 1 forks source link

Adding more control for validation and test dataset composition #49

Closed JLrumberger closed 1 year ago

JLrumberger commented 1 year ago

Instructions

Add a class method to ModelBuilder that filters sets of FOV names from the tfrecord into the respective test and validation datasets.

Relevant background

Instead of taking the first x tiles as test and validation data, we want to get more control over the composition of the datasets by explicitly using lists of FOV names to construct them.

Design overview

Implement function ModelBuilder.filter_fovs(self, dataset, fov_list, positive_list), that takes a dataset, a list of FOVs and a boolean indicating if the fov_list is a positive or negative list, meaning if the respective fovs should be filtered out of the dataset or filtered into a dataset that gets returned. The function returns one dataset that only contains samples whose FOVs are in fov_list if positive_list=True and one dataset that contains all samples except the ones whose FOVs are in fov_list if positive_list=False.

Code mockup

def filter_fovs(self, dataset, fov_list, positive_list, fov_key="fov"):
    dataset_tmp = copy(dataset)
    if positive_list:
        def predicate(fov):
            return fov in fov_list
    if not positive_list:
        def predicate(fov):
            return fov not in fov_list

    dataset_tmp = dataset_tmp.filter(lambda example: tf.py_function(predicate, [example[fov_key]], tf.bool))
    return dataset_tmp

Required inputs .tfrecord dataset that holds fov_key as key and list fov_list that is stored as .json and will be loaded in the init.

Output files

Loaded tfrecord dataset that holds samples from the specified FOVs.

Timeline Give a rough estimate for how long you think the project will take. In general, it's better to be too conservative rather than too optimistic.

Estimated date when a fully implemented version will be ready for review:

Estimated date when the finalized project will be merged in:

ngreenwald commented 1 year ago

For datasets with a large number of distinct images, this seems perfect. For some of the datasets with larger images, but a smaller number, we may want to do balanced selection, instead of random. For example, making sure that for rare markers, we have good enough representation in the val/test datasets. We can cross that bridge when we get to it, this will definitely work for TONIC & MSK