FelixNahrstedt / ClimateSet_Fork

A Large-Scale Climate Model Dataset for Machine Learning
0 stars 0 forks source link

Selection of Optimisation Techniques #3

Open FelixNahrstedt opened 5 months ago

FelixNahrstedt commented 5 months ago

mats

Implementation Tasks:

  1. Setup Baseline with Wandb
  2. Make slurm file to execute superemulator on cluster
  3. Running with A100 instead of RTX8000 → Still with 32 GB?
  4. Use TORCH.TENSOR.BFLOAT16 within Mixed Precision Training (training can crash due to rounding errors though)
  5. Pin the Memory to prevent swapping and improve host device transfer
  6. Increase Number of Workers

Unstructured Notes

→ long time series for superemulator → batch size is large

if needed: compute canada documentation on pytorch for multi gpu

no multi node training on mila cluster possible

training can crash due to rounding errors

ideally multiple jobs queued → if one dies shedule next from checkpoint

look into logs and resume from last checkpoint

preemtible GPUs recommended

implement mixed precision

Batch size, mixed precision, GPU should be constantly utilised