Training tweaks (grokking training scheme, selective loss function and label smoothing)

Instructions

In order to get a good baseline model a few tweaks need to be implemented:

improve optimization by adding grokking training scheme (cosine decaying learning rate scheduler, weight decay) to model_builder.py.
Add loss_function.py where we have a loss_function class that wraps tf.keras.losses and has a class method that let's us set the loss to zero for image regions whose training label is -1 (i.e. not specific enough marker for the celltype)

Relevant background

This just adds a few tweaks that are known to improve training and generalization. It should give us a good baseline to compare future approaches that target the noisy labels problem.

Design overview

Weight decay is a class attribute of tf.keras.layers, so I'll write a utility function that takes in a model, iterates through the layers and adds l2 weight decay on kernels and biases of all layers that contain weights.
LR scheduler : we use the one from deepcell.utils.train_utils.rate_scheduler at the moment. If deepcell has a cosine decay scheduler we'll use this one, if not I'll replace it with tf.keras.optimizers.schedules.CosineDecay in prep_model.py (l64)
I'll replace ModelBuilder.prep_loss with a separate class that wraps tf.keras.losses and zeros out loss regions where the label is -1, it should also allow to use other loss functions from tf.keras.losses such as focal loss

Code mockup

LossFunction():
  def __init__(self, loss_name: str, selective_masking: bool, kwargs*)
    self.loss_fn = self.get_loss_fn(loss_name, kwargs*)

  def mask_loss(self, loss_img, gt):
    loss_img[gt==-1] = 0

  def get_loss_fn(self, loss_name):
    self.loss_fn = getattr(tf.keras.losses, loss_name)(kwargs*)

  def forward(pred, gt):
    loss_img = self.loss_fn(pred, gt)
    if self.selective_masking:
      loss_img = self.mask_loss(loss_img, gt)
    return loss_img.mean()

Required inputs

Weight decay requires a factor to control its strengths (i.e. 1e-4)
LR scheduler requires minimal LR and training steps
loss functions use different arguments and take kwargs* so that they are all accessible via this wrapper

Output files

This just improves training but no direct output is created.

Timeline Give a rough estimate for how long you think the project will take. In general, it's better to be too conservative rather than too optimistic.

[X] A couple days
[ ] A week
[ ] Multiple weeks. For large projects, make sure to agree on a plan that isn't just a single monster PR at the end.

Estimated date when a fully implemented version will be ready for review: tomorrow

Estimated date when the finalized project will be merged in: tomorrow

angelolab / Nimbus

Training tweaks (grokking training scheme, selective loss function and label smoothing) #21

Instructions