OrderN / CONQUEST-release

Full public release of large scale and linear scaling DFT code CONQUEST
http://www.order-n.org/
MIT License
96 stars 25 forks source link

Memory leak for linear scaling ompGemm_m kernel on atom movement #280

Closed davidbowler closed 4 months ago

davidbowler commented 7 months ago

I have identified a significant memory leak in O(N) geometry optimisation, though not the source. The system (which I will attach input files for when I have a better internet connection) is 4,096 atoms of silicon (8x8x8 fundamental cells) running on 64 processes (though this does not seem to matter) with a SZ basis and L range of 16 bohr. With the develop branch, at some point after the initial movement of atoms the memory use spikes (I've seen at least 16GB in one process) and even with 1.5TB of RAM the job fails. The same behaviour is not seen for v1.2.

davidbowler commented 7 months ago

Input files attached. MemoryError.zip

davidbowler commented 7 months ago

The problem seems to lie only in the ompGemm_m version of multiply_kernel. After atoms are moved, the maxima found at the start of m_kern_max and m_kern_min become very large (and in some cases overflow) leading to the memory problem. I'm not yet sure why these maxima fail...

tkoskela commented 5 months ago

I created a new test in the testsuite test_004 using the files in MemoryError.zip. It seems like all the multiply kernels fail on it.

See https://github.com/OrderN/CONQUEST-release/actions/runs/7035615252/job/19146415009

Interesting enough, ompGemm_m fails in test_003, so maybe there is a different problem there

tkoskela commented 5 months ago

Regarding ompGemm_m, running test_003 on my laptop it looks like there can be uninitialized values in bndim2 and nbnab. These come from pointers to the prefetched array in multiply_module.

The problem is that we are assuming a size mx_part for the integer arrays nbnab,ibarddr, etc. but this is not always correct. Because we are using a pointer, we don't explicitly allocate a size for the array (although it looks like it might be possible to figure out from the code in the kernel). The maxval function then looks at all elements from 1 to mx_part and in case of some of the pointers finds uninitialized memory.

Personally I would be strongly in favour of getting rid of the pointers. One possible solution would be to just pass the integer indices to each sub-array instead, if the underlying data structure is too difficult to change.

tkoskela commented 5 months ago

A simple solution to this particular issue seems to be using assumed-shape arrays in the multiply kernel, ie. integer(integ) :: ibaddr(:) instead of integer(integ) :: ibaddr(mx_part). How exactly the correct shape gets propagated to the kernel is anyone's guess, maybe because the pointer is pointing to a chunk of memory of specified size, that information is contained in the pointer.

tsuyoshi38 commented 5 months ago

Let me ask you a question related to this problem. With the changes in tk-fix-multiply-kernel-memory-leak, has the problem of memory leak been actually solved?

We are also suffering from the memory leak in MD simulations (though not O(N) calculations). I think the changes in the branch make sense. Since the pointers are used as usual passed arrays in the subroutines like end_part_comms, the actual arrays may be prepared and copied from part_array (by compilers) and used in the subroutines. We have memory leaks If these arrays would not be deallocated properly. (But, then we should have this problem also in long SCF calculations..) Do you think it happens with the present CONQUEST?

tkoskela commented 5 months ago

The problem that #293 fixes is the receiving subroutines in multiply_kernel assume an array of size mx_part, but the pointer that is passed in may be smaller. As far as I could tell, this was not causing errors anywhere except for multiply_kernel_ompGemm_m, where the maxval function was used to find the maximum value of the whole array, and the return value of maxval was being used as the size of a new allocatable array. Other accesses to the array were done by index, where presumably the index was calculated to fit the bounds of the actual array. In end_part_comms the pointers were already being passed into assumed shape arrays so I didn't see the same problem there.

Which multiply kernel are you using in your simulations? I suppose the easiest thing to do would be to test whether your issues are resolved by the changes in #293, if not then we need to debug further.

Another thing to test would be #294, which avoids the use of pointers completely, to see if the pointers are causing memory leaks.

In general, I think arrays used in subroutines should always get deallocated when the subroutine goes out of scope, so I don't immediately see how that would lead to memory leaks. I would first look for allocates going awry somewhere.

davidbowler commented 5 months ago

The changes in #293 don't seem to fix this issue, frustratingly.

davidbowler commented 5 months ago

Nor do the changes in #294. I have found a simpler example, however, which runs happily on 16 processes and crashes immediately on running with ompGemm_m which is attached below.

SimplerExample.zip

negf commented 5 months ago

Hi,

Sorry to butt in the conversation. I think the problem is that for

maxnd2 = maxval(bndim2)

bndim2 is not properly initialized. It can have random values, which would also explain the random memory usage. Initializing

part_array = 0

just before

      call prefetch(kpart,a_b_c%ahalo,a_b_c%comms,a_b_c%bmat,icall,&
           n_cont,part_array,a_b_c%bindex,b_rem,lenb_rem,b,myid,ilen2,&
           mx_msg_per_part,a_b_c%parts,a_b_c%prim,a_b_c%gcs,(recv_part(nnode)-1)*2)

in multiply_module.f90 (line 243) solve the problem for me.

tsuyoshi38 commented 5 months ago

The problem that #293 fixes is the receiving subroutines in multiply_kernel assume an array of size mx_part, but the pointer that is passed in may be smaller. As far as I could tell, this was not causing errors anywhere except for multiply_kernel_ompGemm_m, where the maxval function was used to find the maximum value of the whole array, and the return value of maxval was being used as the size of a new allocatable array. Other accesses to the array were done by index, where presumably the index was calculated to fit the bounds of the actual array. In end_part_comms the pointers were already being passed into assumed shape arrays so I didn't see the same problem there.

Which multiply kernel are you using in your simulations? I suppose the easiest thing to do would be to test whether your issues are resolved by the changes in #293, if not then we need to debug further.

Another thing to test would be #294, which avoids the use of pointers completely, to see if the pointers are causing memory leaks.

In general, I think arrays used in subroutines should always get deallocated when the subroutine goes out of scope, so I don't immediately see how that would lead to memory leaks. I would first look for allocates going awry somewhere.

Sorry for may late reply. We are using default multiply kernes, and have found that the changes in #294 do not solve the problem of memory leak. (using the branch tk-refactor-multiply-module-pointers)
I may have forgotten to be mention that, but in our case, we are suffering from the memory leak in the MD calculations (after a long run) with diagonalisation. I still thought that replacing the pointers with allocatable arrays would help. But, it did not, unfortunately..

davidbowler commented 5 months ago

Thank you @negf for adding your contribution: this fixes the problem in most cases. It inspired me to check for other occurrences of maxval where variables might be uninitialised and I found another two. Fixing these (i.e. setting them to zero) seems to fix the problem of the crash in this case. I will open a new branch and issue a pull request with the fix (though it may not help with the MD leak that @tsuyoshi38 is reporting - I don't know).

davidbowler commented 5 months ago

I have added my changes to #293 which should now fix this problem.

davidbowler commented 4 months ago

Closing now because #293 has sorted this issue