possible device memory leak in HMC interface to QUDA

etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.

http://www.itkp.uni-bonn.de/~urbach/software.html

GNU General Public License v3.0

32 stars 47 forks source link

possible device memory leak in HMC interface to QUDA #507

Closed kostrzewa closed 2 years ago

kostrzewa commented 2 years ago

In test runs of cD211.054.96, the job aborts after three trajectories because it exceeds the device memory limits:

Note that this was a synthetic test with zero acceptance rate.

kostrzewa commented 2 years ago

Note that there is no reason for memory usage to increase so we must have an instance of memory not being freed.

kostrzewa commented 2 years ago

The strange thing is that I cannot reproduce this in runs on my machine...

urbach commented 2 years ago

The strange thing is that I cannot reproduce this in runs on my machine...

compiler bug...?

kostrzewa commented 2 years ago

It can be any number of things. The test on my machine is of course not very representative because I'm using a (slightly) different CUDA version and, more importantly, don't have communication between multiple nodes (so there's not really any comms, just P2P memory accesses in my case).

If we had CUDA 11.x, GCC 9.3.0 and OpenMPI 4.0.x on QBIG, that would be a viable path to a reproduction test on a different machine... Another possiblity is PizDaint, however.

kostrzewa commented 2 years ago

I think I've narrowed this down to destroyQudaMultigrid and subsequent reallocation. It seems that the memory isn't really freed. The reason I suspect this is the fact that this particular memory leak occurs when a trajectory is rejected, which is basically the only time that the setup is regenerated.

As noted by @sbacchio, it might make sense to also regenerate the setup once per trajectory in any case. Refreshing (rather than evolving) the the test / null vectors is probably a good idea once per trajectory and the cost is quite small.

pittlerf commented 2 years ago

I have seen this kind of issue in the run I did yesterday on Marconi A100, at some point I just run out of memory.

# TM_QUDA: Time for reorder_gauge_toQuda 2.427405e-02 s level: 3 proc_id: 0 /HMC/GAUGE:gauge_derivative/compute_gauge_derivative_quda/reorder_gauge_toQuda
ERROR: Failed to allocate device memory of size 445906944 (/m100_work/INF21_lqcd123_0/fpittler/quda_lattice/lib/cuda_gauge_field.cpp:46 in cudaGaugeField())
 (rank 10, host r219n15, malloc.cpp:227 in device_malloc_())

With Simone we try to catch this error

pittlerf commented 2 years ago

dear all, we have investigated this issue with @sbacchio . We run two test runs with L=28 nf=2+1+1. we have found setting quda enviromental variables: export QUDA_ENABLE_DEVICE_MEMORY_POOL=0; export QUDA_ENABLE_PINNED_MEMORY_POOL=0 is essential to avoid running out of memory. As an illustration we show here the mem usage with setting the two flags above to zero and leaving them by default. Setting to zero leads to essentially constant memory usage. devicememorypool0

kostrzewa commented 2 years ago

@pittlerf Thanks, this is a good workaround.

Does setting either of the two resolve the problem too? Using a pool for pinned (host) memory should not lead to a memory leak on the device.

Using a device memory pool might cause problems if, instead of reusing the pool memory, QUDA's allocator keeps allocating new chunks instead of reusing the pool memory...

kostrzewa commented 2 years ago

@maddyscientist Have you perhaps observed a situation where setting QUDA_ENABLE_DEVICE_MEMORY_POOL=1 and calling destroyMultigridQuda and subsequently newMutligridQuda leads to a device memory leak? (because pool memory is perhaps not reused)? Could well be that I simply did something stupid in our interface (https://github.com/etmc/tmLQCD/blob/ce9753848039a9a2f63eeff6d244fd1b4c62718d/quda_interface.c#L1479) but I don't really see why what I implemented should be a problem.

By contrast, when QUDA_ENABLE_DEVICE_MEMORY_POOL=0 it seems that memory is properly freed when the setup is reset (which I do by destroying and then recreating it).

kostrzewa commented 2 years ago

@pittlerf Just for completeness: in your test, did you have the same acceptance rate in both cases? The setup is only ever really destroyed when a trajectory is not accepted.

pittlerf commented 2 years ago

@pittlerf Just for completeness: in your test, did you have the same acceptance rate in both cases? The setup is only ever really destroyed when a trajectory is not accepted.

Hi Bartek, I just checked, no I do not have the same acceptance rate. For the run without device memory pool = 0, i have actually zero acceptance rate (only did 1 trajectory before crashing):

hmctest-227520_4294967294_227520.out:# Trajectory is not accepted.

however with setting device memory pool to zero I have 100% acceptance rate for 6 trajectory:

# Computed plaquette value: 0.568414752442.
# Acceptance rate was 100.00 percent, 6 out of 6 trajectories accepted.

kostrzewa commented 2 years ago

Thanks. But that indicates that we're not talking about the same problem, I think. The two runs should be identical up to differences down to random numbers in the MG setup (until you run out of memory). In principle,there should not be any memory leak unless a trajectory is rejected (at least as far as I've observed but I might have concluded incorrectly). I guess we need to investigate more...

kostrzewa commented 2 years ago

So just to confirm, I don't see a memory leak (as long as no trajectories are rejected) even with both memory pools enabled (note that this is a dual GPU, single node run, so it might not be comparable to your case).

kostrzewa commented 2 years ago

Note that with the device memory pool enabled, maximum device memory usage is quite a bit higher. (for my specific test 2.2 GB vs 1.4 GB). I suspect that your job crashed because you simply ran out of memory when running with the device pool enabled, not because there was a memory leak as such.

pittlerf commented 2 years ago

Thanks, yeah, we need to investigate this further, I was running on 2nodes each 2 P100 GPU-s and did 3 different runs with device memory pool enabled and all of them crashed

kostrzewa commented 2 years ago

I can confirm that setting QUDA_ENABLE_DEVICE_MEMORY_POOL=0 prevents the memory leak that I've observed. To test this explicitly, this is a run with 0% acceptance such that the MG setup is destroyed and recreated at the start of every trajectory.

Device memory usage follows an expected pattern over many trajectories (with some variation at the start as you can see, I think this run spent time tuning because I switched from 2 GPUs to 1 to be able to use my system while the test was running):

kostrzewa commented 2 years ago

Interesting behaviour when I set QUDA_ENABLE_DEVICE_MEMORY_POOL=1 in a similar run with zero acceptance and thus repeating destruction and recreation of the MG setup: after a couple of trajectories, the pool allocator seems to be satisfied and stops allocating new memory. Memory usage then remains constant (although more than a factor of two higher than with the device memory pool disabled).

The memory leak thus appears to not really be a leak but simply a consequence of high memory usage when the pool allocator is used. Still, it would be better if the pool allocator would correctly re-use the already allocated pool...

kostrzewa commented 2 years ago

And just to make sure I repeated the exercise running on 2 GPUs with the same result.

Since we have so many other overheads, I think disabling the device memory pool should not hurt us too much.

sbacchio commented 2 years ago

@kostrzewa do you see any speed-up / advantage by using the memory pool? I was aware of this behavior and it was one of the flags we were often disabling to save memory... Never seen a significant speed-up between on and off.

kostrzewa commented 2 years ago

I think if you have any other overhead to worry about then the pool can be safely disabled without incurring any noticeable penalty. I was also aware of the fact that it uses more memory, but I'm very surprised that it's this extreme...

sbacchio commented 2 years ago

Do you think setting to 0 this variable from tmLQCD before initializing QUDA would work? And do we want to hard-code it or hope that everyone remembers to do it?

kostrzewa commented 2 years ago

Do you think setting to 0 this variable from tmLQCD before initializing QUDA would work?

This is a good idea. Done: https://github.com/etmc/tmLQCD/pull/518