Closed yurivict closed 5 years ago
Well, you know, he may be right. But it would certainly help if we knew the version of OpenBLAS that you are currently using, and a bit more about what erkale does. I guess erkale itself is multithreading and thus calling into OpenBLAS from multiple threads, which causes problems that I am currently trying to fix.
openblas-0.2.20_3,1 on FreeBSD erkale experiences this problem when run with multithreading using OpenMP.
Could you try with current develop
branch ("soon" to be 0.3.4) please ? Apart from a number of fixes, that one has a new compile-time parameter NUM_PARALLEL to reduce the risk of running out of unique thread pointers in what looks like your use case.
Alternatively rebuild package with OpenMP that should be more moderate in openmp program and not try to spawn n^2 threads
erkale
is built with both OpenMP and OpenBLAS.
I'm trying to fix test failures in its parallel version.
The message
Because you tried to allocate too many memory regions.
begs the question "How many regions were allocated?"
Also with the ever increasing computing power, what does "too many" really mean? Why is this limitation imposed?
It is a fixed table of regions.... 1-2 are consumed per parallel thread what was built into system package derived from NUM_THREADS at build time got exceeded. You can throw in any value you prefer where recent improvement was made: https://github.com/xianyi/OpenBLAS/pull/1858
Why can't you reallocate it dynamically when exceeded instead of fixing it once and for good during build?
This limit is directly related to the NUM_THREADS parameter set at build time (which defaults to the number of cores detected on the build host ). There has been a recent attempt to rewrite the memory allocation logic (that dates back to K. Goto's original libGotoBLAS of 10+ years ago) using thread-local storage. Unfortunately the reimplementation met a number of unexpected corner cases and it is unclear if it is safe to use as the default in its current state. See option USE_TLS in current develop
.
Indeed OpenBLAS FreeBSD port would be built with 16 malloc slots only. https://svnweb.freebsd.org/ports/head/math/openblas/Makefile?view=markup#l61 PR above is meant to plainly mask such cases so that ages old limitation does not hurt every other user. I'd recommend to change package makefile with like 64 threads (128 slots), and use OpenMP , since you wrap library in OpenMP and OMP openblas reduces threading if called in OMP parallel section.
I'll change the limit in the port for now.
But no matter what the limit value would be set, this problem will come back because the number of threads shouldn't even in theory be tied to the number of CPUs in general (threads can be half-idle for example). This needs to be solved.
@yurivict I think you got the message into right ears.
You are wrong about number of threads. The constraining resource here is CPU cache, OpenBLAS( or MKL for that sake) would operate on limited amount of data , fitting in L1d/L2/L3/L4 caches. Obvious if 2 threads of a kind meet on same core they go with 10-20x slower memory accesses from main memory and performance goes 10x down. What is wrong here is that number of memory buffers is compiled in, and bound to build CPUs, and hurts people oversubscribing CPU cores (I mean caches)
You assume that all threads are CPU-intense. But some threads might be idle. Some might work on separate data sets while using only 10% of CPU each. Some people create threads per connection, etc. All sorts of use models can take place.
That 10% would break the assumption of computation kernels that cache is for their exclusive use, and both compute kernels on same core will slow down N times more than just in half as with normal compiler-emitted code. The aim is to plainly get results out faster, not to have 100% CPU usage in "top" or maximize CPU temperatures.
Change of NUM_THREADS
to 64
didn't fix all erkale's test failures. Some of them still fail with the same message.
@yurivict while at it can you push #1785 that is reducing swarm of unproductive locks (syscalls) per each BLAS call that hurts a lot on high core number systems? (it is old code, but must be re-based for old version because of recent renumberings in particular file)
@yurivict do they (tests) pass with OPENBLAS_NUM_THREADS=1 and/or with OpenMP OpenBLAS? EDIT yeah, i know it may still hurt casual users, but no easy chance with current code. How many cores the build system has? It will spin up that many squared threads if program uses OMP and OpenBLAS then uses pthreads inside.
do they (tests) pass with OPENBLAS_NUM_THREADS=1 and/or with OpenMP OpenBLAS?
They still fail.
Summary:
The change to NUM_THREADS=64
in the port didn't help, OPENBLAS_NUM_THREADS=1
also doesn't help, gotoblas
fails the same way when used instead of OpenBlas
.
What helped: change to liblapack.so/libblas.so/libcblas.so. Tests pass with this implementation.
Testcase: The Erkale quantum chemistry project (https://github.com/susilehtola/erkale) built with -DUSE_OPENMP=ON
. ctest
tests fail when linked with OpenBlas.
You mean openblas.so fails completlely? Or you had to direct .BLAS .cblas .lapack alll to OpenBLAS at once? Do you have any log of failure to repeat at "home"?
openblas.so
fails completely. Replacing it with .BLAS/.cblas/.lapack combination allows the process to succeed.
It triggers exceptions error, see above, and the processes crash.
The log?
2: Test command: /usr/ports/science/erkale/work-parallel/.build/src/test/basictests_omp
2: Test timeout computed to be: 10000000
2: Indices OK.
2: Solid harmonics OK.
2: Checkpointing OK.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! : 2 0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is
2/2 Test #2: basictests .......................***Exception: SegFault 0.96 sec
The following tests passed:
build_basictests
50% tests passed, 1 tests failed out of 2
Total Test time (real) = 1.82 sec
The following tests FAILED:
2 - basictests (SEGFAULT)
Errors while running CTest
Is it the log from: 1/ OPENBLAS_NUM_THREADS=1 (or 2) 2/ USE_OPENMP=1 Or you just knowingly run N threads on each of N CPUs?
HOW MANY CPU CORES ARE THERE IN THE BUILD MACHINE?
HOW MANY CPU CORES ARE THERE IN THE BUILD MACHINE?
4 cores, 8 virtual CPUs.
USE_OPENMP=1
is used in the erkale
project.
I watched how it runs, tests run with 8 threads. It might be that it runs more threads for a short period of time.
But again, this should be up to project authors how to allocate threads.
I will try to get something out of Linux and erakle
If all tests run in same program continuously-there are some uninitialized values fixed, that may probably worth waiting for 0.3.4 instead of rushing 0.3.3 Messages about bad deallocation also mean that something alloc/free was not paired properly, i.e memory leak, also wort checking against later/-st version
Which project is to blame for allocating threads? So far I see just slight misconfiguration, and probably old version.
@yurivict did not observe anything weird on CentOS7, using all system libraries and all builtin libraries.
128 buffers stemming from 64 threads should be able to house 8 times 8 threads for supposed pessimal case that you get pthread library in place of others. What OpenBLAS in test logs say is that something went wrong with memory allocations it is alloc/free workalikes, but using stack, shmget etc, and storing pointers in fixed-size aray.
I do not recall anything specifically being done in this regard recently, maybe @martin-frbg can recall if something as serious as leading to mem leak was fixed. I suggest to try OMP 0.3.3 , maybe issue is fixed there, but not certain.
It will be very hard to narrow down, as the memory leak (in form of forgetting pointer) happened before the errors about no more pointer slots were thrown.
Other thing might be something #ifdef FREEBSD
is wrong.
I do not recall anything specifically being done in this regard recently
This isn't known to be a regression. Erkale is a fairly new port, and parallel tests weren't run before on FreeBSD,
I have not gotten around to building Erkale yet, but it appears to me that the unallocation errors occur only after the "too many memory regions" error messages, at which point all bets are probably off already. In any case it will probably not make sense to try to fix this retroactively in 0.2.20 - either try 0.3.3 or (better) current develop
to hopefully get this fixed in time for 0.3.4
@jurivict can you try to make 0.3.3 Following changes needed to FreeBSD makefile: 1/ new tarball with new checksums (sort of obvious) 2/ take out USE_TLS=1 from Makefile.rule 3/ add DYNAMIC_OLDER=1 next to DYNAMIC_ARCH=1 for amd64/x86_64 Though better to wait for 3.4 that does not need /2/ and has some significant bugfixes too.
Got it built now. No errors seen in basictest.omp with current develop
on i8700k (6cores/12 threads) under Linux, no complaints from valgrind either.
I got none from 0.2.20 fetched by erkale distribution and 0.3.3 opensuse 15.0 packages, I think we can totally exclude Linux from the picture for now. I think I am able to try freebsd port update locally. Actually deallocating 0x0 looks like other bug, sort of it should back off when allocating got 0x0 before it.
Freeing memory is not wrapped with locks in memory.c unlike alloc? ... in principle it can be strategically raced to spot all memory slots full when there is no problem, like with small unfused functions, whose numbers are reduced over time already (i mean 0.3.4), and maybe problem does not show off in current version.
[FreeBSD] gcc is not required for OpenBLAS, it could be built with system clang + port flang (just to blame somebody else for waiting for gcc build to complete), at least clang is used by android and iOS/OSX, thus quite well tested.
@martin-frbg I will get gprof from particular test suite, maybe some fuse is needed in addition to wrapping race before 0.3.4, but I will first test if one lock/unlock pair is enough to fix affected 0.2.20, then come up with PR. Probably linux could be raced similarily repeatably but you need more scarce buffers first.
@yurivict very strange - does not fail in a VM, and 4 threads observable like 8 by you.
I see a problem that 2 OMP libraries get linked in erkale - one is gomp in OpenBLAS via gcc other is libomp via clang , probably leading to N*N threads in tests. Could you try to compile erkale forcing g++ instead?
Tests SEGV for some reason when built with gcc, with both OpenBlas and Netlib lapack implementations.
Arrrghhhh. Should not be THAT bad.
I suggest following improvements to the port:
1 try - this has chance to fix crash / Make gfortran-emitted code thread safe : #1857 (check netlib too, it needs same to be called from threads) - e.g. erkale OMP could use single-threaded version made this way. Otherwise there is a problem in gfortran emitted code, somewhere where debuggers do not have a grip.
2 please push to port / add MAKE_NB_JOBS=-1 basically obey parent make -jX
, described in Makefile.rule in more detail. Thats not related to current issue.
btw flang port is broken , it shows version, but error: unable to execute command: Executable flang1 doesn't exist!
when compiling anything.
Make gfortran-emitted code thread safe : #1857
Does this simply mean that I need to add FFLAGS=-frecursive
?
btw flang port is broken , it shows version, but error: unable to execute command: Executable flang1 doesn't exist! when compiling anything.
I created the PR for this, thank you for reporting.
-frecursive has to be applied to gfortran only, flang luckily lets it through with a warning, so at first sight yup, just add it, but dont forget to take out later.
FCOMMON_OPT=-frecursive
, OpenBLAS build system combines FFLAGS later for lapack, including this along with -O2
EDIT I think without spaces if in same command line
OK, thanks, I'll add this to the OpenBLAS port.
Flang wasn't widely adopted because it fails to compile a lot of projects, and also it is amd64-only, doesn't work on any other platforms.
I'll apply this patch to the port:
@@ -59,7 +59,7 @@
.endif
MAXTHREADS?= 64
-BUILDFLAGS_THREAD+= NUM_THREADS=${MAXTHREADS} USE_THREAD=1
+BUILDFLAGS_THREAD+= NUM_THREADS=${MAXTHREADS} FCOMMON_OPT=-frecursive MAKE_NB_JOBS=-1 USE_THREAD=1
.if ${ARCH:M*64} == ""
BUILDFLAGS+= BINARY32=1
Looks OK
FYI Build doesn't utilize all CPUs, it runs a lot of small compilation jobs on one CPU, sequentially. (This isn't related to the current issue.)
Parts of the build process are serialized to avoid races - GNU make is not very sophisticated in this regard.
BLAS part does not have inter-dependencies, so you can get 100+ cores utilized for few seconds for each CPU generation, but serialized parts (ar) in between.
The patch has been committed to the FreeBSD port (math/openblas
).
Both options should go to non-threaded version too -fopenmp would imply -frecursive, but single threaded version will have unsafe fortran function representations that cannot work from C/C++ pthreads or OpenMP - local temporary arrays of sufficient size like >32-64k would be allocated in global heap shared between threads without any arbitaration whatsoever, leading to at least certain least numeric failures. I have got the same crash with g++ , backtracing to something main -> read_config -> assert, not yet involving any BLAS.
Ok BLAS imports (probably some of L1 is masked by gsl cblas macros)
All BLAS have thread limits, it is a performance issue for particular functions for small inputs, not crasher or something
There are some dangerous LAPACK functions getting imported mandating frecursive
BLAS L1
ddot_
BLAS L2
dgemv_ zgemv_
BLAS L3
dgemm_ zgemm_
dsyrk_
zherk_
LAPACK THREADSAFE
ilaenv_
dgetrf_
dgetri_
LAPACK needing -frecursive
dgesv_
dgelsd_
dgels_
dgesvx_
Let me summarize: 1/ GOMP and CLANG OMP are not very friendly (probably they emit different pthread lock IDs, but from same toplevel functions leading to locks not working right at all) 2/ G++ leads to early crashes 3/ lapack functions that were not thread-safe before frecursive are present
I think for now it is best to import pthread version in all circumstances in serial programs and single-threaded, safeguarded with -frecursive in threaded ones, and keep the GOMP version in the basement for programs that do not crash when build with GCC world (as disabled by default option for example)
The only dangers are performance-related i.e OMP program imports threaded version and gets N^2 threads which can be brought under control with variables, or single threaded program imports single threaded version, still faster than netlib, but with big space for improvement
Improvements gained towards 0.3.4:
I see now that OPENMP
isn't a default option in the port, changes that I made only apply to the non-default case. I'll move them to be for all versions.
@martin-frbg
dgetrf
zherk
dsyrk
are not guarded against early threading.
Ignore me if I dont produce PRs today.
@yurivict
Following changes will need to go to 0.3.4 on top of what we did here:
DYNAMIC_OLDER=1 should supplement amd64/x86_64 DYNAMIC_ARCH=1
No more need for -frecursive, it is now in right place.
There will be AVX-512 (Skylake-X) support , both FreeBSD clang and gcc can compile it, so new option for that(?).
In principle it builds with clang+flang(once later works) too, if you want to experiment in other side of OMP world, but not required at all.
I think we cannot improve anything here, but feel free to report if you stumble on anything similarily weird.
I used openblas for blas/lapack functions in the erkale project, and it fails. erkale's author says that openblas is broken, see https://github.com/susilehtola/erkale/issues/29#issuecomment-441006738