Closed davidbowler closed 4 months ago
Input files attached. MemoryError.zip
The problem seems to lie only in the ompGemm_m version of multiply_kernel. After atoms are moved, the maxima found at the start of m_kern_max
and m_kern_min
become very large (and in some cases overflow) leading to the memory problem. I'm not yet sure why these maxima fail...
I created a new test in the testsuite test_004
using the files in MemoryError.zip
. It seems like all the multiply kernels fail on it.
See https://github.com/OrderN/CONQUEST-release/actions/runs/7035615252/job/19146415009
Interesting enough, ompGemm_m
fails in test_003
, so maybe there is a different problem there
Regarding ompGemm_m
, running test_003
on my laptop it looks like there can be uninitialized values in bndim2
and nbnab
. These come from pointers to the prefetched array in multiply_module
.
The problem is that we are assuming a size mx_part
for the integer arrays nbnab
,ibarddr
, etc. but this is not always correct. Because we are using a pointer, we don't explicitly allocate a size for the array (although it looks like it might be possible to figure out from the code in the kernel). The maxval
function then looks at all elements from 1 to mx_part
and in case of some of the pointers finds uninitialized memory.
Personally I would be strongly in favour of getting rid of the pointers. One possible solution would be to just pass the integer indices to each sub-array instead, if the underlying data structure is too difficult to change.
A simple solution to this particular issue seems to be using assumed-shape arrays in the multiply kernel, ie. integer(integ) :: ibaddr(:)
instead of integer(integ) :: ibaddr(mx_part)
. How exactly the correct shape gets propagated to the kernel is anyone's guess, maybe because the pointer is pointing to a chunk of memory of specified size, that information is contained in the pointer.
Let me ask you a question related to this problem. With the changes in tk-fix-multiply-kernel-memory-leak, has the problem of memory leak been actually solved?
We are also suffering from the memory leak in MD simulations (though not O(N) calculations).
I think the changes in the branch make sense. Since the pointers are used as usual passed arrays in the subroutines like end_part_comms
, the actual arrays may be prepared and copied from part_array
(by compilers) and used in the subroutines. We have memory leaks If these arrays would not be deallocated properly. (But, then we should have this problem also in long SCF calculations..) Do you think it happens with the present CONQUEST?
The problem that #293 fixes is the receiving subroutines in multiply_kernel
assume an array of size mx_part
, but the pointer that is passed in may be smaller. As far as I could tell, this was not causing errors anywhere except for multiply_kernel_ompGemm_m
, where the maxval
function was used to find the maximum value of the whole array, and the return value of maxval
was being used as the size of a new allocatable array. Other accesses to the array were done by index, where presumably the index was calculated to fit the bounds of the actual array. In end_part_comms
the pointers were already being passed into assumed shape arrays so I didn't see the same problem there.
Which multiply kernel are you using in your simulations? I suppose the easiest thing to do would be to test whether your issues are resolved by the changes in #293, if not then we need to debug further.
Another thing to test would be #294, which avoids the use of pointers completely, to see if the pointers are causing memory leaks.
In general, I think arrays used in subroutines should always get deallocated when the subroutine goes out of scope, so I don't immediately see how that would lead to memory leaks. I would first look for allocates going awry somewhere.
The changes in #293 don't seem to fix this issue, frustratingly.
Nor do the changes in #294. I have found a simpler example, however, which runs happily on 16 processes and crashes immediately on running with ompGemm_m
which is attached below.
Hi,
Sorry to butt in the conversation. I think the problem is that for
maxnd2 = maxval(bndim2)
bndim2 is not properly initialized. It can have random values, which would also explain the random memory usage. Initializing
part_array = 0
just before
call prefetch(kpart,a_b_c%ahalo,a_b_c%comms,a_b_c%bmat,icall,&
n_cont,part_array,a_b_c%bindex,b_rem,lenb_rem,b,myid,ilen2,&
mx_msg_per_part,a_b_c%parts,a_b_c%prim,a_b_c%gcs,(recv_part(nnode)-1)*2)
in multiply_module.f90 (line 243) solve the problem for me.
The problem that #293 fixes is the receiving subroutines in
multiply_kernel
assume an array of sizemx_part
, but the pointer that is passed in may be smaller. As far as I could tell, this was not causing errors anywhere except formultiply_kernel_ompGemm_m
, where themaxval
function was used to find the maximum value of the whole array, and the return value ofmaxval
was being used as the size of a new allocatable array. Other accesses to the array were done by index, where presumably the index was calculated to fit the bounds of the actual array. Inend_part_comms
the pointers were already being passed into assumed shape arrays so I didn't see the same problem there.Which multiply kernel are you using in your simulations? I suppose the easiest thing to do would be to test whether your issues are resolved by the changes in #293, if not then we need to debug further.
Another thing to test would be #294, which avoids the use of pointers completely, to see if the pointers are causing memory leaks.
In general, I think arrays used in subroutines should always get deallocated when the subroutine goes out of scope, so I don't immediately see how that would lead to memory leaks. I would first look for allocates going awry somewhere.
Sorry for may late reply. We are using default multiply kernes, and have found that the changes in #294 do not solve the problem of memory leak. (using the branch tk-refactor-multiply-module-pointers
)
I may have forgotten to be mention that, but in our case, we are suffering from the memory leak in the MD calculations (after a long run) with diagonalisation. I still thought that replacing the pointers with allocatable arrays would help. But, it did not, unfortunately..
Thank you @negf for adding your contribution: this fixes the problem in most cases. It inspired me to check for other occurrences of maxval
where variables might be uninitialised and I found another two. Fixing these (i.e. setting them to zero) seems to fix the problem of the crash in this case. I will open a new branch and issue a pull request with the fix (though it may not help with the MD leak that @tsuyoshi38 is reporting - I don't know).
I have added my changes to #293 which should now fix this problem.
Closing now because #293 has sorted this issue
I have identified a significant memory leak in O(N) geometry optimisation, though not the source. The system (which I will attach input files for when I have a better internet connection) is 4,096 atoms of silicon (8x8x8 fundamental cells) running on 64 processes (though this does not seem to matter) with a SZ basis and L range of 16 bohr. With the develop branch, at some point after the initial movement of atoms the memory use spikes (I've seen at least 16GB in one process) and even with 1.5TB of RAM the job fails. The same behaviour is not seen for v1.2.