AI-Hypercomputer / maxtext

A simple, performant and scalable Jax LLM!
Apache License 2.0
1.47k stars 275 forks source link

Standalone checkpoint write seems to have memory leak #831

Open bernardhan33 opened 1 month ago

bernardhan33 commented 1 month ago

Attempting to run a standalone checkpointing workload using an 1T model (9.96TiB checkpoint size) on 512 n2-standard-32 nodes and observing that the memory usage slowly increases over time and eventually reported OOM after ~ 60 writes. Here's the memory consumption chart:

6upQPjZZfRXHvQr

gobbleturk commented 1 week ago

Have you been working on this? I think your team is good to own / investigate this one