angelolab / Nimbus

Other
12 stars 1 forks source link

Filter out sparse examples #34

Closed JLrumberger closed 1 year ago

JLrumberger commented 1 year ago

Relevant background

We want to further reduce our dataset. Thus we calculate the 40 / 60 / x % quantile of positive cells per tile for each marker and filter out all tiles that have less than the x-quantile of positive cells.

Design overview

  1. Calculate the x-quantile of positive cells for each marker
  2. Filter out Examples from the training data that have less than x-quantile positive cells

Code mockup

Add class function quantile_filter to ModelBuilder

def quantile_filter(self):
    num_pos_dict = {}
    for example in self.training_dataset:
        marker = example["marker"]
        if marker not in num_pos_dict.keys():
            num_pos_dict["marker"] = []
        num_pos_dict["marker"].append([np.sum(example["activity_df"].labels==1))
    quantile_dict = {}
        for marker, pos_list in num_pos_dict.items():
            quantile_dict[marker] = np.quantile(pos_list, self.filter_quantile)
    predicate = lambda example: np.sum(example["activity_df"].labels==1) > quantile_dict[example["marker"]]
    self.train_dataset = self.train_dataset.filter(predicate)

Required inputs

tf.data.TFRecordDataset containing the training data.

Output files

tf.data.TFRecordDataset containing the filtered training data.

Timeline Give a rough estimate for how long you think the project will take. In general, it's better to be too conservative rather than too optimistic.

Estimated date when a fully implemented version will be ready for review:

Estimated date when the finalized project will be merged in:

ngreenwald commented 1 year ago

Nice and simple. Looks good