CliMA / Oceananigans.jl

🌊 Julia software for fast, friendly, flexible, ocean-flavored fluid dynamics on CPUs and GPUs
https://clima.github.io/OceananigansDocumentation/stable
MIT License
966 stars 190 forks source link

Out of memory error with Docs tests #3779

Open simone-silvestri opened 2 days ago

simone-silvestri commented 2 days ago

For example: https://buildkite.com/clima/oceananigans/builds/17468#0191fc2c-b421-4cf7-80b5-0429336b1d7f https://buildkite.com/clima/oceananigans/builds/17473#0191fd88-d8b9-48d5-9c7f-18efc6747ea7 I believe this is because we are launching the docs from many different branches on a relatively small GPU. I think it would be best to move this test on the caltech cluster. Since the caltech cluster works with a slurm scheduler this error would never happen (I can open a PR to fix this)

glwagner commented 1 day ago

No, that's not the reason and I don't think we should move the docs to the caltech cluster. The reason is that other users of tartarus are using GPU 0.

glwagner commented 1 day ago

I mean, we can move the docs to the caltech cluster but I think they will slow down a lot. This is a bottleneck for us right now so I don't think we can afford to move them...

Notice that the out of memory error doesn't occur when we are using the GPU. We only use the GPU for the one quick start example --- and for nothing else.

If we want to "solve" this, we can just get rid of the quick start example and then return to the previous behavior where we set CUDA_VISIBLE_DEVICES=-1 for the docs build.

Another solution is to hide / prevent tartarus users from using GPU 0.

simone-silvestri commented 1 day ago

I see it like this: Tartarus is a shared system with very small limitations so it is quite difficult to prevent people from running on GPU 0 (I do not think we have the ability to implement a scheduler), so there is more possibility to incur in downtime due to users running on GPU 0. The Caltech cluster might be slower but is much more reliable because it has a professionally maintained slurm scheduler that prevents these types of problems. I tend to prefer having reliability over a modest speedup for these cases, but I am ok with other solutions. One solution would be to routinely kill the jobs on tartarus running on GPU 0 without warning, that would be possible only for people with access on tartarus though. I am ok following that route (I just killed a couple of jobs now 😅). It would be nice to find a more permanent solution though.

glwagner commented 1 day ago

I think we can hide GPU 0. I suspect the tartarus CPU is much faster.

This problem only arose because I exposed GPU 0 to the docs build.

The slurm stuff is just maintained by us, so if we are all professionals, we are professionals...

glwagner commented 1 day ago

I could see us eventually moving towards using tartarus only for docs. We could do some bigger GPU stuff then.

glwagner commented 1 day ago

Just take a step back. We had a working system until we exposed the GPU. I did that as an experiment and added the quick start example.

Now, if the experiment isn't working, let's revisit it. Moving the docs to caltech is a nuclear option. If it gives us speed up --- great. That's a good reason. But if it's just for the GPU issue, it makes no sense. It's like we tried to experiment with a new vegetable in our pasta sauce, didn't like the vegetable, and decided to stop eating dinner altogether as as result. It's not logical.

simone-silvestri commented 1 day ago

Ok I am convinced, I will close the PR and maybe we can monitor more closely GPU 0 for the moment until we find a stable solution.

glwagner commented 22 hours ago

Should we set CUDA_VISIBLE_DEVICES=-1 back for the docs? We just have to change the GPU quick start which is not a big deal (changing from a doctest to a static example)