Quality of life and helper callback functions

laserkelvin commented 4 months ago

This PR introduces and adds a bunch of changes pertaining to informing the user of things happening under the hood, particularly during training.

One of the big philosophical changes is also focusing more on enabling logging to be done with TensorBoardLogger and WandbLoggers by writing functions more tailored to them, rather than before where loggers were treated in the abstract entirely.

Summary

Changed the use of coordinates in periodic boundary utilities to use cartesian coordinates, not fractional coordinates. Also included a warning message that looks at the coordinates as part of diagnostics.
For model training, task modules now include log_embeddings and log_embeddings_every_n_steps arguments that are saved to hparams, which as the pair suggests, allow you to regularly log embedding vectors for analysis. This will let you ensure oversmoothing doesn't occur, where all of the embedding features become identical.
Introduced a TrainingHelperCallback, which is intended to help diagnose some common issues with training, such as unused parameters, missing gradients, tiny gradients, etc. Complimentary to the change above, there is an option to inject a forward hook to any encoder (assuming it produces an Embeddings structure), and uses it to calculate the variance in embeddings.
Introduced a ModelAutocorrelation callback, which will perform an autocorrelation analysis on model parameters and gradients over the course of training. Basically this gives you some insight into how the training dynamics appear, i.e. too much correlation = probably not good.

My intention for the TrainingHelperCallback is to be like a guide for best practices: we can refine this as we go and discover new things, and hopefully will be useful for everyone including new users.

laserkelvin commented 4 months ago

I have somehow broken SAM and need to fix it first before review

laserkelvin commented 4 months ago

I think I have a lead on what the issue is: because of how SAM works, and because of the modifications to "stashing" embeddings in the batch structure, we now end up with two disjoint computational graphs that causes backward to break.

This needs a bit of thought to fix...

laserkelvin commented 4 months ago

Confirming this by changing out the BaseTaskModule.forward:

        if "embeddings" in batch:
            embeddings = batch.get("embeddings")
        else:
            embeddings = self.encoder(batch)
            batch["embeddings"] = embeddings
        outputs = self.process_embedding(embeddings)
        return outputs

Removing the branch, and just running the encoder + processing embeddings works (i.e. don't try and grab cached embeddings).

Ideally there would be a way to check if embeddings originated from the same computational graph, but that take a lot more surgery than this PR warrants. I'll think of an alternative to this.

The reason we are stashing the embeddings is to benefit the multitask case, where we would want to not have to run the encoder X times for X tasks and datasets.

IntelLabs / matsciml

Quality of life and helper callback functions #237

Summary