mats

Implementation Tasks:

Setup Baseline with Wandb
Make slurm file to execute superemulator on cluster
Running with A100 instead of RTX8000 → Still with 32 GB?
Use TORCH.TENSOR.BFLOAT16 within Mixed Precision Training (training can crash due to rounding errors though)
Pin the Memory to prevent swapping and improve host device transfer
Increase Number of Workers

Unstructured Notes

→ long time series for superemulator → batch size is large

if needed: compute canada documentation on pytorch for multi gpu

no multi node training on mila cluster possible

training can crash due to rounding errors

ideally multiple jobs queued → if one dies shedule next from checkpoint

look into logs and resume from last checkpoint

preemtible GPUs recommended

implement mixed precision

Batch size, mixed precision, GPU should be constantly utilised