Open NyanCatTW1 opened 2 years ago
I just reverted the commit that I think is likely causing the issue. Did that fix it?
That failed. The memory usage started going up again after multiplying 2425308 relations
➜ /home/nyancat/Tools/msieve_nfsathome git:(75ff92a) ✗ git bisect bad
75ff92a5c72f572c03ae3b7e201e2b92ec121837 is the first bad commit
commit 75ff92a5c72f572c03ae3b7e201e2b92ec121837
Author: Greg Childers <childers@dt-login02.delta.internal.ncsa.edu>
Date: Sun Aug 14 12:52:27 2022 -0500
Add OpenMP support for sqrt
Makefile | 4 ++++
common/driver.c | 10 ++++++++++
gnfs/relation.c | 4 +++-
gnfs/sqrt/sqrt.c | 1 +
gnfs/sqrt/sqrt_a.c | 24 +++++++++++++++++++-----
include/msieve.h | 4 ++++
6 files changed, 41 insertions(+), 6 deletions(-)
Seems like the leak goes all the way back to the day sqrt became multithreaded
I'll mess with the code a bit to see what might be causing the leak (Everything below is tested on commit 75ff92a5c72f572c03ae3b7e201e2b92ec121837) OMP=0: Good OMP=1 with every #pragma in sqrt_a.c removed: Good
else {
/* recursively compute the product of the first
half and the last half of the relations */
uint32 mid = (index1 + index2) / 2;
#pragma omp parallel
#pragma omp single
{
#pragma omp task
multiply_relations(prodinfo, index1, mid, &prod1);
#pragma omp task
multiply_relations(prodinfo, mid + 1, index2, &prod2);
#pragma omp taskwait
}
}
Simply having these #pragma causes the leak
GMP calls do reallocations as needed internally. Looks like there's still an issue with malloc and OpenMP as described in this thread from over a decade ago. https://gmplib.org/list-archives/gmp-discuss/2009-July/003852.html If you are so inclined, you could try using Hoard to see if it resolves the issue.
On the other hand, the real issue is that it took 11 dependencies to find the factors. That should (almost) never happen and is surely a bug. I'll look at this more tomorrow, but you can try redoing both LA and sqrt with the latest commit to see if it still requires so many dependencies.
Thanks for the info. I'll rerun everything starting from filtering on latest and see how that goes
Ran Valgrind on latest. Seems like there are two main kinds of leaks: 1.
==21437== 84,948,160 bytes in 482,660 blocks are possibly lost in loss record 264 of 264
==21437== at 0x4838751: malloc (vg_replace_malloc.c:307)
==21437== by 0x63557C7: ___kmp_allocate (in /usr/lib64/libomp.so)
==21437== by 0x63DD825: __ompt_lw_taskteam_link(ompt_lw_taskteam_s*, kmp_info*, int, bool) (in /usr/lib64/libomp.so)
==21437== by 0x636DF07: __kmp_fork_call (in /usr/lib64/libomp.so)
==21437== by 0x63C8DD3: __kmp_GOMP_fork_call(ident*, int, unsigned int, unsigned int, void (*)(void*), void (*)(int*, int*, ...), int, ...) (in /usr/lib64/libomp.so)
==21437== by 0x63CE089: GOMP_parallel@@VERSION (in /usr/lib64/libomp.so)
==21437== by 0x430C84: mpz_poly_mul.constprop.0 (sqrt_a.c:170)
==21437== by 0x4314C0: multiply_relations (sqrt_a.c:356)
==21437== by 0x638F946: __kmp_invoke_task(int, kmp_task*, kmp_taskdata*) (in /usr/lib64/libomp.so)
==21437== by 0x638FB95: __kmpc_omp_task (in /usr/lib64/libomp.so)
==21437== by 0x63CDA94: GOMP_task@@VERSION (in /usr/lib64/libomp.so)
==21437== by 0x4303C9: multiply_relations._omp_fn.0 (sqrt_a.c:347)
2.
==21437== 30,783,104 bytes in 174,904 blocks are possibly lost in loss record 257 of 264
==21437== at 0x4838751: malloc (vg_replace_malloc.c:307)
==21437== by 0x63557C7: ___kmp_allocate (in /usr/lib64/libomp.so)
==21437== by 0x63DD825: __ompt_lw_taskteam_link(ompt_lw_taskteam_s*, kmp_info*, int, bool) (in /usr/lib64/libomp.so)
==21437== by 0x636DF07: __kmp_fork_call (in /usr/lib64/libomp.so)
==21437== by 0x63C8DD3: __kmp_GOMP_fork_call(ident*, int, unsigned int, unsigned int, void (*)(void*), void (*)(int*, int*, ...), int, ...) (in /usr/lib64/libomp.so)
==21437== by 0x63CE089: GOMP_parallel@@VERSION (in /usr/lib64/libomp.so)
==21437== by 0x4314A7: multiply_relations (sqrt_a.c:344)
==21437== by 0x638F946: __kmp_invoke_task(int, kmp_task*, kmp_taskdata*) (in /usr/lib64/libomp.so)
==21437== by 0x638FB95: __kmpc_omp_task (in /usr/lib64/libomp.so)
==21437== by 0x63CDA94: GOMP_task@@VERSION (in /usr/lib64/libomp.so)
==21437== by 0x4303C9: multiply_relations._omp_fn.0 (sqrt_a.c:347)
==21437== by 0x63CE0AD: GOMP_parallel@@VERSION (in /usr/lib64/libomp.so)
I'll be honest, that did not provide a clue on where the leak came from...
LD_PRELOAD=/home/nyancat/bin/libhoard.so msieve -nc3 -i N.txt -v -t 6
didn't help. In fact, the memory usage became 1.5x, which is odd
I'll try linking it in during compilation after LA finishes
That didn't work either. Looks like Hoard can't save me this time.
I've had a way better luck on the new run and had it finished on dependency 1. This led to an...interesting discovery. It took 65 seconds on try_openmp and 80 seconds on master. Multithread only won by 20%, which makes me wonder whether we should just remove OMP from Square Root.
Okay...what? Removing the #pragma actually speed it up by another 10%? That's new.
I reran your postprocessing twice and another larger number once and got the factors on the first or second dependency each time, so reverting that commit fixed the corruption that was causing the need for many dependencies. Now I just have to figure out what is wrong with that commit...
Even with the cycle issue fixed and assume that each dependency has a 2/3 chance of yielding a non-trivial factor, there is still a 1/3^5 =1/243 chance that it would exhaust 16 GB of memory. I guess a compromise could be to add an build/runtime option to enable/disable OpenMP specifically during Square Root, otherwise I risk triggering a OoM every time I run square root.
Another option would be to run msieve separately for each dependency until the number is fully factored.
./msieve -nc3 1,1 -t 4 if number isn't fully factored, ./msieve -nc3 2,2 -t 4 and so on until the number is fully factored.
There are consistent memory leak during Square Root process, making the memory usage considerably higher than old version. (160 MB on master vs >8 GB on try_openmp) Reproduction can be as simple as
msieve -nc3 -i N.txt -v
Files for reproduction (2.0 GB): https://mega.nz/file/TkhGRbbL#2sotY_6x93Owxwpckoihdy0XuaYmcomkmD5YispwVNcI will be running a git bisect in 2022-08-29 9 AM (GMT+8) if it wasn't fixed then.