OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.38k stars 1.5k forks source link

BLAS : Program is Terminated. Because you tried to allocate too many memory regions. #1882

Closed yurivict closed 5 years ago

yurivict commented 5 years ago

I used openblas for blas/lapack functions in the erkale project, and it fails. erkale's author says that openblas is broken, see https://github.com/susilehtola/erkale/issues/29#issuecomment-441006738

martin-frbg commented 5 years ago

Well, you know, he may be right. But it would certainly help if we knew the version of OpenBLAS that you are currently using, and a bit more about what erkale does. I guess erkale itself is multithreading and thus calling into OpenBLAS from multiple threads, which causes problems that I am currently trying to fix.

yurivict commented 5 years ago

openblas-0.2.20_3,1 on FreeBSD erkale experiences this problem when run with multithreading using OpenMP.

martin-frbg commented 5 years ago

Could you try with current develop branch ("soon" to be 0.3.4) please ? Apart from a number of fixes, that one has a new compile-time parameter NUM_PARALLEL to reduce the risk of running out of unique thread pointers in what looks like your use case.

brada4 commented 5 years ago

Alternatively rebuild package with OpenMP that should be more moderate in openmp program and not try to spawn n^2 threads

yurivict commented 5 years ago

erkale is built with both OpenMP and OpenBLAS. I'm trying to fix test failures in its parallel version.

yurivict commented 5 years ago

The message

Because you tried to allocate too many memory regions.

begs the question "How many regions were allocated?"

Also with the ever increasing computing power, what does "too many" really mean? Why is this limitation imposed?

brada4 commented 5 years ago

It is a fixed table of regions.... 1-2 are consumed per parallel thread what was built into system package derived from NUM_THREADS at build time got exceeded. You can throw in any value you prefer where recent improvement was made: https://github.com/xianyi/OpenBLAS/pull/1858

yurivict commented 5 years ago

Why can't you reallocate it dynamically when exceeded instead of fixing it once and for good during build?

martin-frbg commented 5 years ago

This limit is directly related to the NUM_THREADS parameter set at build time (which defaults to the number of cores detected on the build host ). There has been a recent attempt to rewrite the memory allocation logic (that dates back to K. Goto's original libGotoBLAS of 10+ years ago) using thread-local storage. Unfortunately the reimplementation met a number of unexpected corner cases and it is unclear if it is safe to use as the default in its current state. See option USE_TLS in current develop.

brada4 commented 5 years ago

Indeed OpenBLAS FreeBSD port would be built with 16 malloc slots only. https://svnweb.freebsd.org/ports/head/math/openblas/Makefile?view=markup#l61 PR above is meant to plainly mask such cases so that ages old limitation does not hurt every other user. I'd recommend to change package makefile with like 64 threads (128 slots), and use OpenMP , since you wrap library in OpenMP and OMP openblas reduces threading if called in OMP parallel section.

yurivict commented 5 years ago

I'll change the limit in the port for now.

But no matter what the limit value would be set, this problem will come back because the number of threads shouldn't even in theory be tied to the number of CPUs in general (threads can be half-idle for example). This needs to be solved.

brada4 commented 5 years ago

@yurivict I think you got the message into right ears.

You are wrong about number of threads. The constraining resource here is CPU cache, OpenBLAS( or MKL for that sake) would operate on limited amount of data , fitting in L1d/L2/L3/L4 caches. Obvious if 2 threads of a kind meet on same core they go with 10-20x slower memory accesses from main memory and performance goes 10x down. What is wrong here is that number of memory buffers is compiled in, and bound to build CPUs, and hurts people oversubscribing CPU cores (I mean caches)

yurivict commented 5 years ago

You assume that all threads are CPU-intense. But some threads might be idle. Some might work on separate data sets while using only 10% of CPU each. Some people create threads per connection, etc. All sorts of use models can take place.

brada4 commented 5 years ago

That 10% would break the assumption of computation kernels that cache is for their exclusive use, and both compute kernels on same core will slow down N times more than just in half as with normal compiler-emitted code. The aim is to plainly get results out faster, not to have 100% CPU usage in "top" or maximize CPU temperatures.

yurivict commented 5 years ago

Change of NUM_THREADS to 64 didn't fix all erkale's test failures. Some of them still fail with the same message.

brada4 commented 5 years ago

@yurivict while at it can you push #1785 that is reducing swarm of unproductive locks (syscalls) per each BLAS call that hurts a lot on high core number systems? (it is old code, but must be re-based for old version because of recent renumberings in particular file)

brada4 commented 5 years ago

@yurivict do they (tests) pass with OPENBLAS_NUM_THREADS=1 and/or with OpenMP OpenBLAS? EDIT yeah, i know it may still hurt casual users, but no easy chance with current code. How many cores the build system has? It will spin up that many squared threads if program uses OMP and OpenBLAS then uses pthreads inside.

yurivict commented 5 years ago

do they (tests) pass with OPENBLAS_NUM_THREADS=1 and/or with OpenMP OpenBLAS?

They still fail.

yurivict commented 5 years ago

Summary:

The change to NUM_THREADS=64 in the port didn't help, OPENBLAS_NUM_THREADS=1 also doesn't help, gotoblas fails the same way when used instead of OpenBlas.

What helped: change to liblapack.so/libblas.so/libcblas.so. Tests pass with this implementation.

Testcase: The Erkale quantum chemistry project (https://github.com/susilehtola/erkale) built with -DUSE_OPENMP=ON. ctest tests fail when linked with OpenBlas.

brada4 commented 5 years ago

You mean openblas.so fails completlely? Or you had to direct .BLAS .cblas .lapack alll to OpenBLAS at once? Do you have any log of failure to repeat at "home"?

yurivict commented 5 years ago

openblas.so fails completely. Replacing it with .BLAS/.cblas/.lapack combination allows the process to succeed.

It triggers exceptions error, see above, and the processes crash.

brada4 commented 5 years ago

The log?

yurivict commented 5 years ago
2: Test command: /usr/ports/science/erkale/work-parallel/.build/src/test/basictests_omp
2: Test timeout computed to be: 10000000
2: Indices OK.
2: Solid harmonics OK.
2: Checkpointing OK.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Bad memory unallocation! :    2  0x0
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
2: BLAS : Program is 
2/2 Test #2: basictests .......................***Exception: SegFault  0.96 sec

The following tests passed:
    build_basictests

50% tests passed, 1 tests failed out of 2

Total Test time (real) =   1.82 sec

The following tests FAILED:
      2 - basictests (SEGFAULT)
Errors while running CTest
brada4 commented 5 years ago

Is it the log from: 1/ OPENBLAS_NUM_THREADS=1 (or 2) 2/ USE_OPENMP=1 Or you just knowingly run N threads on each of N CPUs?

HOW MANY CPU CORES ARE THERE IN THE BUILD MACHINE?

yurivict commented 5 years ago

HOW MANY CPU CORES ARE THERE IN THE BUILD MACHINE?

4 cores, 8 virtual CPUs.

USE_OPENMP=1 is used in the erkale project. I watched how it runs, tests run with 8 threads. It might be that it runs more threads for a short period of time. But again, this should be up to project authors how to allocate threads.

brada4 commented 5 years ago

I will try to get something out of Linux and erakle

If all tests run in same program continuously-there are some uninitialized values fixed, that may probably worth waiting for 0.3.4 instead of rushing 0.3.3 Messages about bad deallocation also mean that something alloc/free was not paired properly, i.e memory leak, also wort checking against later/-st version

Which project is to blame for allocating threads? So far I see just slight misconfiguration, and probably old version.

brada4 commented 5 years ago

@yurivict did not observe anything weird on CentOS7, using all system libraries and all builtin libraries.

128 buffers stemming from 64 threads should be able to house 8 times 8 threads for supposed pessimal case that you get pthread library in place of others. What OpenBLAS in test logs say is that something went wrong with memory allocations it is alloc/free workalikes, but using stack, shmget etc, and storing pointers in fixed-size aray.

I do not recall anything specifically being done in this regard recently, maybe @martin-frbg can recall if something as serious as leading to mem leak was fixed. I suggest to try OMP 0.3.3 , maybe issue is fixed there, but not certain.

It will be very hard to narrow down, as the memory leak (in form of forgetting pointer) happened before the errors about no more pointer slots were thrown.

Other thing might be something #ifdef FREEBSD is wrong.

yurivict commented 5 years ago

I do not recall anything specifically being done in this regard recently

This isn't known to be a regression. Erkale is a fairly new port, and parallel tests weren't run before on FreeBSD,

martin-frbg commented 5 years ago

I have not gotten around to building Erkale yet, but it appears to me that the unallocation errors occur only after the "too many memory regions" error messages, at which point all bets are probably off already. In any case it will probably not make sense to try to fix this retroactively in 0.2.20 - either try 0.3.3 or (better) current develop to hopefully get this fixed in time for 0.3.4

brada4 commented 5 years ago

@jurivict can you try to make 0.3.3 Following changes needed to FreeBSD makefile: 1/ new tarball with new checksums (sort of obvious) 2/ take out USE_TLS=1 from Makefile.rule 3/ add DYNAMIC_OLDER=1 next to DYNAMIC_ARCH=1 for amd64/x86_64 Though better to wait for 3.4 that does not need /2/ and has some significant bugfixes too.

martin-frbg commented 5 years ago

Got it built now. No errors seen in basictest.omp with current develop on i8700k (6cores/12 threads) under Linux, no complaints from valgrind either.

brada4 commented 5 years ago

I got none from 0.2.20 fetched by erkale distribution and 0.3.3 opensuse 15.0 packages, I think we can totally exclude Linux from the picture for now. I think I am able to try freebsd port update locally. Actually deallocating 0x0 looks like other bug, sort of it should back off when allocating got 0x0 before it.

brada4 commented 5 years ago

Freeing memory is not wrapped with locks in memory.c unlike alloc? ... in principle it can be strategically raced to spot all memory slots full when there is no problem, like with small unfused functions, whose numbers are reduced over time already (i mean 0.3.4), and maybe problem does not show off in current version.

[FreeBSD] gcc is not required for OpenBLAS, it could be built with system clang + port flang (just to blame somebody else for waiting for gcc build to complete), at least clang is used by android and iOS/OSX, thus quite well tested.

@martin-frbg I will get gprof from particular test suite, maybe some fuse is needed in addition to wrapping race before 0.3.4, but I will first test if one lock/unlock pair is enough to fix affected 0.2.20, then come up with PR. Probably linux could be raced similarily repeatably but you need more scarce buffers first.

brada4 commented 5 years ago

@yurivict very strange - does not fail in a VM, and 4 threads observable like 8 by you.

I see a problem that 2 OMP libraries get linked in erkale - one is gomp in OpenBLAS via gcc other is libomp via clang , probably leading to N*N threads in tests. Could you try to compile erkale forcing g++ instead?

yurivict commented 5 years ago

Tests SEGV for some reason when built with gcc, with both OpenBlas and Netlib lapack implementations.

brada4 commented 5 years ago

Arrrghhhh. Should not be THAT bad.

I suggest following improvements to the port:

1 try - this has chance to fix crash / Make gfortran-emitted code thread safe : #1857 (check netlib too, it needs same to be called from threads) - e.g. erkale OMP could use single-threaded version made this way. Otherwise there is a problem in gfortran emitted code, somewhere where debuggers do not have a grip.

2 please push to port / add MAKE_NB_JOBS=-1 basically obey parent make -jX , described in Makefile.rule in more detail. Thats not related to current issue.

btw flang port is broken , it shows version, but error: unable to execute command: Executable flang1 doesn't exist! when compiling anything.

yurivict commented 5 years ago

Make gfortran-emitted code thread safe : #1857

Does this simply mean that I need to add FFLAGS=-frecursive?

btw flang port is broken , it shows version, but error: unable to execute command: Executable flang1 doesn't exist! when compiling anything.

I created the PR for this, thank you for reporting.

brada4 commented 5 years ago

-frecursive has to be applied to gfortran only, flang luckily lets it through with a warning, so at first sight yup, just add it, but dont forget to take out later.

brada4 commented 5 years ago

FCOMMON_OPT=-frecursive , OpenBLAS build system combines FFLAGS later for lapack, including this along with -O2 EDIT I think without spaces if in same command line

yurivict commented 5 years ago

OK, thanks, I'll add this to the OpenBLAS port.


Flang wasn't widely adopted because it fails to compile a lot of projects, and also it is amd64-only, doesn't work on any other platforms.

yurivict commented 5 years ago

I'll apply this patch to the port:

@@ -59,7 +59,7 @@
 .endif

 MAXTHREADS?=   64
-BUILDFLAGS_THREAD+=    NUM_THREADS=${MAXTHREADS} USE_THREAD=1
+BUILDFLAGS_THREAD+=    NUM_THREADS=${MAXTHREADS} FCOMMON_OPT=-frecursive MAKE_NB_JOBS=-1 USE_THREAD=1

 .if ${ARCH:M*64} == ""
 BUILDFLAGS+=   BINARY32=1
brada4 commented 5 years ago

Looks OK

yurivict commented 5 years ago

FYI Build doesn't utilize all CPUs, it runs a lot of small compilation jobs on one CPU, sequentially. (This isn't related to the current issue.)

martin-frbg commented 5 years ago

Parts of the build process are serialized to avoid races - GNU make is not very sophisticated in this regard.

brada4 commented 5 years ago

BLAS part does not have inter-dependencies, so you can get 100+ cores utilized for few seconds for each CPU generation, but serialized parts (ar) in between.

yurivict commented 5 years ago

The patch has been committed to the FreeBSD port (math/openblas).

brada4 commented 5 years ago

Both options should go to non-threaded version too -fopenmp would imply -frecursive, but single threaded version will have unsafe fortran function representations that cannot work from C/C++ pthreads or OpenMP - local temporary arrays of sufficient size like >32-64k would be allocated in global heap shared between threads without any arbitaration whatsoever, leading to at least certain least numeric failures. I have got the same crash with g++ , backtracing to something main -> read_config -> assert, not yet involving any BLAS.

brada4 commented 5 years ago

Ok BLAS imports (probably some of L1 is masked by gsl cblas macros)

All BLAS have thread limits, it is a performance issue for particular functions for small inputs, not crasher or something

There are some dangerous LAPACK functions getting imported mandating frecursive

BLAS L1
ddot_
BLAS L2
dgemv_ zgemv_
BLAS L3
dgemm_ zgemm_
dsyrk_
zherk_
LAPACK THREADSAFE
ilaenv_
dgetrf_
dgetri_
LAPACK needing -frecursive
dgesv_
dgelsd_
dgels_
dgesvx_

Let me summarize: 1/ GOMP and CLANG OMP are not very friendly (probably they emit different pthread lock IDs, but from same toplevel functions leading to locks not working right at all) 2/ G++ leads to early crashes 3/ lapack functions that were not thread-safe before frecursive are present

I think for now it is best to import pthread version in all circumstances in serial programs and single-threaded, safeguarded with -frecursive in threaded ones, and keep the GOMP version in the basement for programs that do not crash when build with GCC world (as disabled by default option for example)

The only dangers are performance-related i.e OMP program imports threaded version and gets N^2 threads which can be brought under control with variables, or single threaded program imports single threaded version, still faster than netlib, but with big space for improvement

Improvements gained towards 0.3.4:

yurivict commented 5 years ago

I see now that OPENMP isn't a default option in the port, changes that I made only apply to the non-default case. I'll move them to be for all versions.

brada4 commented 5 years ago

@martin-frbg dgetrf zherk dsyrk are not guarded against early threading. Ignore me if I dont produce PRs today. @yurivict Following changes will need to go to 0.3.4 on top of what we did here: DYNAMIC_OLDER=1 should supplement amd64/x86_64 DYNAMIC_ARCH=1

No more need for -frecursive, it is now in right place.

There will be AVX-512 (Skylake-X) support , both FreeBSD clang and gcc can compile it, so new option for that(?).

In principle it builds with clang+flang(once later works) too, if you want to experiment in other side of OMP world, but not required at all.

I think we cannot improve anything here, but feel free to report if you stumble on anything similarily weird.