GPU memory size - Githubissues

waywardpidgeon commented 1 week ago

While running the near_global_ocean_simulation.jl a few days ago I had an overflow of my GPU memory error quite soon after starting. Now I have tried again (with a fresh version of "main") and its been running for about one hour with error completing the initial time step in 476.6 ms. The current state of the GPU is in the attached screen-shot.

GPUrunningNearGlobal

I have an Nvidia RTX A2000 with 6Gbytes of GPU memory. The MIT simulations appear to be done on an H100 with over 70Gbytes.

Query: if needed to complete this example should I reduce the depth from Nz=40 to say Nz=10, making the example similar to the grid size of the documentation for ClimaOcean example, which did complete, or has the ClimaOcean or Oceananigans code been adapted to this rather small 6Gbyte memory size for the GPU?

waywardpidgeon commented 1 week ago

The simulation referred to above was incomplete after 200 iterations finding a NaN in u_ocean. The screen messages are given in

near_global_stopped.txt

glwagner commented 6 days ago

@waywardpidgeon changing Nz will reduce the number of grid points, but not the depth. However, you could consider a shallower simulation. Another possibility is to try a one-degree simulation instead:

https://github.com/CliMA/ClimaOcean.jl/blob/main/experiments/one_degree_simulation/one_degree_simulation.jl

I'm not 100% sure about the status of this simulation, but PR https://github.com/CliMA/ClimaOcean.jl/pull/260 is open to continue working on it and making it better. You might follow along and could report issues there which will help push that effort forward.

You can also simply reduce the resolution of the simulation you are working with by changing Nx, Ny here:

https://github.com/CliMA/ClimaOcean.jl/blob/5a32b3599b2c3e24461ab3ff4727e010ac366aa4/examples/near_global_ocean_simulation.jl#L35-L36

To diagnose causes of model blow up, I suggest to first start by printing output more frequently. For example, you could change this line:

https://github.com/CliMA/ClimaOcean.jl/blob/5a32b3599b2c3e24461ab3ff4727e010ac366aa4/examples/near_global_ocean_simulation.jl#L153

to

 simulation.callbacks[:progress] = Callback(progress, IterationInterval(1))

to print output every iteration, for example. This will show you more precisely at which iteration the model blows up. I would also try reducing the time-step systematically to see if you can stabilize the simulation. The time-step is set here:

https://github.com/CliMA/ClimaOcean.jl/blob/5a32b3599b2c3e24461ab3ff4727e010ac366aa4/examples/near_global_ocean_simulation.jl#L124

CliMA / ClimaOcean.jl

GPU memory size #261