Open FelixNahrstedt opened 5 months ago
Implementation Tasks:
→ long time series for superemulator → batch size is large
if needed: compute canada documentation on pytorch for multi gpu
no multi node training on mila cluster possible
training can crash due to rounding errors
ideally multiple jobs queued → if one dies shedule next from checkpoint
look into logs and resume from last checkpoint
preemtible GPUs recommended
implement mixed precision
Batch size, mixed precision, GPU should be constantly utilised
mats
Implementation Tasks:
Unstructured Notes
→ long time series for superemulator → batch size is large
if needed: compute canada documentation on pytorch for multi gpu
no multi node training on mila cluster possible
training can crash due to rounding errors
ideally multiple jobs queued → if one dies shedule next from checkpoint
look into logs and resume from last checkpoint
preemtible GPUs recommended
implement mixed precision
Batch size, mixed precision, GPU should be constantly utilised