When training s2ef tasks with otf_graph=True, I observe a memory leak that eventually leads to an OOM error:
slurmstepd: error: Detected 1 oom_kill event in StepId=5242886.0. Some of the step tasks have been OOM Killed.
srun: error: slurm-las1-h100-reserved-134-001: task 0: Out Of Memory
model:
# keep the rest as is
otf_graph: true # necessary because training on `data/s2ef/all/train` that is large
slurm:
partition: priority
constraint: h100-reserved
mem: 200GB
This problem is not specific to schnet but seems to be common across all configs. I've chosen schnet to illustrate it because the config uses a large batch size, which increases the rate of memory leak.
In the plot below you can see
purple: configs/s2ef/all/equiformer_v2/equiformer_v2_N@20_L@6_M@3_153M.yml with batch size 3 with otf_graph=true
red: configs/s2ef/all/schnet/schnet.yml with batch size 20 with otf_graph=true
green: a dummy model with batch size 50 with otf_graph=true
yellow: a dummy model with batch size 50 with otf_graph=false => the issue seems to be due to otf_graph=true
When training
s2ef
tasks withotf_graph=True
, I observe a memory leak that eventually leads to an OOM error:To reproduce:
The only changes I've made to the config is
This problem is not specific to
schnet
but seems to be common across all configs. I've chosenschnet
to illustrate it because the config uses a large batch size, which increases the rate of memory leak.In the plot below you can see
configs/s2ef/all/equiformer_v2/equiformer_v2_N@20_L@6_M@3_153M.yml
with batch size3
withotf_graph=true
configs/s2ef/all/schnet/schnet.yml
with batch size20
withotf_graph=true
50
withotf_graph=true
50
withotf_graph=false
=> the issue seems to be due tootf_graph=true
Environment:
In case it's useful is the dummy model to make sure the leak is in data loading as opposed to the model code: