Closed mattip closed 3 years ago
No idea. If anything, thread memory requirements of 0.3.12 should be less than before, and I thought NumPy was satisfied with the state of OpenBLAS after application of the fmod workarounds. Are you bundling the actual 0.3.12 release or some snapshot from around that time ? And the older OpenBLAS you went back to is what - 0.3.10 or even older ?
@martin-frbg Our CI testing showed no problems and the wheels builds succeeded, so 0.3.12 looked good prior to release.
EDIT: I also expect a fair number of people tried 1.19.3 because of the Python 3.9 support. If the problem was universal there would probably have been more bug reports.
Well the only change in 0.3.11/12 that I would expect to have any effect at library/thread initialization time is the reduction of the BLAS3_MEM_ALLOC_THRESHOLD variable, and its previous default is stil available in Makefile.rule (or as a parameter to make
. Reducing this actually made crashes in other people's code go away however. If your "working" OpenBLAS is older than 0.3.10 then we are looking at a scary number of changes, and probably need a git bisect
to get anywhere.
Unfortunately, it's 0.3.9
Original issues got closed without code change https://github.com/numpy/numpy/issues/17674#issuecomment-720637865 https://github.com/numpy/numpy/issues/17684#issuecomment-720303128
@brada4 We switched back to the earlier library for 1.19.4.
Hm thanks. 0.3.10 (or at least the changes that looked important enough to label with the milestone) was mostly CMAKE build improvements and a few thread race fixes, nothing that I would expect to blow up as soon as the library gets loaded.
The build tag (a32f1dca) on OpenBLAS so file is not related to any tags here. Where it does come from, how do we trace back to OpenBLAS source code used to build the library?
Backtrace in those issues says last OpenBLAS thread jumped to 0x0
that should not happen, and without backtrace hard to tell what made it do so.
Also given threads mentioned - would it be possible to run same crashing docker with OPENBLAS_NUM_THREADS=1 added ?
The descriptive string v0.3.7-527-g79fd006c
is the result of git describe --tags --abbrev=8
. The OpenBLAS commit is 79fd006c
and is somewhere between 0.3.9 and 0.3.10.
According to the comment in the issue there is information at https://drive.google.com/drive/folders/18mcYQu4GGPzwCRj14Pzy8LvfC9bcLid8?usp=sharing on the docker environment where the segfault occurs.
It would take ages to reconstruct failing environment 1:1. The question is if it is possible to clearly attribute problem to "OpenBLAS threading" by confirming there is no problem whatsoever when threading thing is turned off via env variable.
Issue mentions /virtualenv/python3.6/lib/python3.6/site-packages/numpy.libs/libopenblasp-r0-a32f1dca.3.12.so
- github search points it to this thread, and full search to same tag mentioned in exact numpy binary build) but not to the code - could you help to dig it to OpenBLAS tag?
I am not picky or anything, would be easier for all if we know numerology behind file tagging and can admit guilty right away ;-)
@brada4 I am not sure what you are asking. A static build of OpenBLAS is downloaded and is built into the so you see with some other things as part of the NumPy build process. The reason it has a hash is to uniquely bind that so to the NumPy build, it does not directly reflect a version of OpenBLAS. The exact version of OpenBLAS used in the build of the wheel is the code tagged with the 1.19.3 tag, and is here - you can see the tag is 0.3.12
. You can download the wheel and check the result of openblas_get_config()
as we do later in that file to verify that the OpenBLAS version is the one we thought it was.
@mattip thanks, that fills the gap in my understanding.
One thing that was changed in 3.12 is x86_64: clobber all xmm registers after vzeroupper. See also Omit clobbers from vzeroupper until final [PR92190]. And Microsoft x64 calling convention tells, that: user-written assembly language routines must preserve XMM6 and XMM7. Is this patch really necessary? And is this valid for the Windows 64 ABI?
Another bit can be found here: Software consequences of extending XMM to YMM (Agner Fog)
However, this scenario is only relevant if the legacy function saves and restores the XMM register, and this happens only in 64-bit Windows. The ABI for 64-bit Windows specifies that register XMM6 - XMM15 have callee-save status, i.e. these registers must be saved and restored if they are used. All other x86 operating systems (32-bit Windows, 32- and 64-bit Linux, BSD and Mac) have no XMM registers with callee-save status. So this discussion is relevant only to 64-bit Windows. There can be no problem in any other operating system because there are no legacy functions that save these registers anyway.
Both the users who reported this issue are using dockers on linux, right?
Callee-saves is implemented in the assembly PROLOGUE/EPILOGUE for Windows so this particular change should play no role here
OpenBLAS 0.3.12 is making large memory requests on initialization that depdne heavily on the number of threads. With 24 threads, I am seeing:
This scales pretty linearly with the number of threads when I set $env:OPENBLAS_NUM_THREADS
.
When I roll back to 0.3.9 the memory ask about 1/4:
I checked 0.3.10 and it also shows high memory usage, so it seems like something between these two releases.
That then is probably the GEMM buffer, configurable as BUFFERSIZE at compile time (in Makefile.rule) The default was increased a few times recently to fix overflows at huge matrix sizes - there is a certainly a trade-off and possibly a fundamental design flaw in OpenBLAS involved.0.3.12 built with make BUFFERSIZE=20
should have about the same footprint as 0.3.9. (see #1698 for original issue, #2538 for the motivation to actually change the defaults)
If we shrink the BUFFERSIZE, would we run the risk of running out of room with 24 threads and large matrices? Should NumPy cap the number of threads with openblas_set_num_threads
?
bumping from 0.3.10 to 0.3.12 in nixpkgs has caused numpy to fail on tests now https://github.com/NixOS/nixpkgs/pull/101780#issuecomment-717440602
stacktrace:
(gdb) backtrace
#0 0x00007fffe33a327a in ?? () from /nix/store/931l4l3ab5fg1x4cf0wx8pqg1prqgdmj-gfortran-9.3.0-lib/lib/libgomp.so.1
#1 0x00007fffe33a1e09 in ?? () from /nix/store/931l4l3ab5fg1x4cf0wx8pqg1prqgdmj-gfortran-9.3.0-lib/lib/libgomp.so.1
#2 0x00007fffe88fc11f in exec_blas () from /nix/store/zk5s85i72gnzfrz95r0n783zvgh11ndm-lapack-3/lib/liblapack.so.3
#3 0x00007fffe87c9d93 in gemm_driver () from /nix/store/zk5s85i72gnzfrz95r0n783zvgh11ndm-lapack-3/lib/liblapack.so.3
#4 0x00007fffe87c9f67 in dgemm_thread_nn () from /nix/store/zk5s85i72gnzfrz95r0n783zvgh11ndm-lapack-3/lib/liblapack.so.3
#5 0x00007fffe86dedc7 in cblas_dgemm () from /nix/store/zk5s85i72gnzfrz95r0n783zvgh11ndm-lapack-3/lib/liblapack.so.3
#6 0x00007fffea2e87c1 in cblas_matrixproduct ()
from /nix/store/1dynhpbdks2lzpjbkz1i5p0rkm5klfn4-python3.8-numpy-1.19.4/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so
@jonringer this looks like something else, not the BUFFERSIZE thing. Wonder why the backtrace shows the OpenMP library and a liblapack but nothing that calls itself OpenBLAS - do you have libopenblas aliased to liblapack, or could this be a mixup of library builds ? (So far I have not seen mention of numpy itself failing tests with 0.3.12, cf charris' comment above https://github.com/xianyi/OpenBLAS/issues/2970#issuecomment-720604628 ) (Accidentally closed this due to touchpad malfunction...)
I am a little hesitant about overriding the OpenBLAS default BUFFERSIZE constant in NumPy's build process. OpenBLAS changed it for good reasons, and it could cause incompatibilities. Is refactoring this on the roadmap?
And we are not even sure that will solve the problem
On the long-term roadmap but I keep bumping the milestone - this is at the very core of OpenBLAS and unless it can be traced to some silly typo from the last couple of years, an attempted fix could have worse consequences than what I experienced wth the ill-fated 0.3.11. The TLS code would probably help but that needs reviewing too. I do not think reducing the BUFFERSIZE would cause more potential incompatibilites than the rollback to 0.3.9 with its smaller default, but I have not even gotten around to trying the docker reproducer. Perhaps the fundamental problem is that GotoBLAS was designed to be compiled for a specific purpose, with parameters adjusted as needed for that particular user (and at a time when typical problem sizes and thread counts were both smaller) . Now with various distributors we have a "one size fits all" thing where some want it to have a minimal footprint while others want to diagonalize huge matrices.
Another question is for the Windows world: do we really need 24 threads for "one size fits all" applications? I presume 12 threads is more than enough for most Windows applications given the typically equipped machines. If you really need more you are asked than to compile OpenBLAS yourself with more threads, buffersize.
ThreadRipper Pro owners would tend to disagree with such over-generalisation.
@brada4, thats right. But owners of older CPUs or with poorly equipped RAM may agree. This is however much more a question concerning numpy deployment, so sorry for the noise.
ThreadRipper Pro owners would tend to disagree with such over-generalisation.
I think most desktop users are unlikely to have a problem. I believe the problem was an experience in containers where the true CPU count of the system was reported even though the container has some limits on its resource use (e.g., memory allocation). Launching NumPy + OpenBLAS on a system with 96 VCPUs could request ~24GiB of memory.
I think many desktop users will have ~2GiB+/vCPU, and so would not directly see this issue. AFAICT all of the reports of Windows problems were about containers.
The Linux issues maybe something else.
Edit: This said, I did experience an issue when I ran a test suite with 12 workers without setting OPENBLAS_NUM_THREAD
where each worker was requesting 6GiB. It didn't crash but the test suite had strange errors such as unexplained segfaults.
@carlkl - waitasec - you mean the huge allocation is per configured maximum and not per actual CPU cores visible? Thats certainly in excess.... Typical systems have gigabyte to ten per core, just that it is not meant all for a single task.
do you have libopenblas aliased to liblapack, or could this be a mixup of library builds ?
Yes, Nixpkgs has them aliased.
$ ls -l ./result/lib/lib*
lrwxrwxrwx 23 root 31 Dec 1969 ./result/lib/libblas.so -> libopenblasp-r0.3.12.so
lrwxrwxrwx 23 root 31 Dec 1969 ./result/lib/libblas.so.3 -> libopenblasp-r0.3.12.so
lrwxrwxrwx 23 root 31 Dec 1969 ./result/lib/libcblas.so -> libopenblasp-r0.3.12.so
lrwxrwxrwx 23 root 31 Dec 1969 ./result/lib/libcblas.so.3 -> libopenblasp-r0.3.12.so
lrwxrwxrwx 23 root 31 Dec 1969 ./result/lib/liblapack.so -> libopenblasp-r0.3.12.so
lrwxrwxrwx 23 root 31 Dec 1969 ./result/lib/liblapack.so.3 -> libopenblasp-r0.3.12.so
lrwxrwxrwx 23 root 31 Dec 1969 ./result/lib/liblapacke.so -> libopenblasp-r0.3.12.so
lrwxrwxrwx 23 root 31 Dec 1969 ./result/lib/liblapacke.so.3 -> libopenblasp-r0.3.12.so
lrwxrwxrwx 23 root 31 Dec 1969 ./result/lib/libopenblas.so -> libopenblasp-r0.3.12.so
lrwxrwxrwx 23 root 31 Dec 1969 ./result/lib/libopenblas.so.0 -> libopenblasp-r0.3.12.so
.r-xr-xr-x 27M root 31 Dec 1969 ./result/lib/libopenblasp-r0.3.12.so
In the case of numpy, it was linked against the lapack
package which took the shared libraries from libopenblas.
blas libraries are "normalized" in their outputs, so that users can choose their blas implementation and rebuild all dependent packages; by default, openblas will be chosen.
@brada4, this is from Makefile.rule
_Due to the way some internal structures are allocated, using a large NUMTHREADS value has a RAM footprint penalty, even if users reduce the actual number of threads at runtime.
However, I'm not sure if the numper of actual CPU cores limit this footprint at startup to this: <= max(NUM_THREADS, no. of CPU cores). @martin-frbg should know.
I remember strictly adding one page in every few-processor build once #1858, but that is non-tls stuff. Probably more feet were added in the meantime.
It didn't crash but the test suite had strange errors such as unexplained segfaults.
Could it be that not all malloc calls are checked for error like in
common.h
blas_memory_alloc
, like the ones for PPC440Obviously... OTOH it is a bit unclear how OpenBLAS should react, "historically" some things are simply expected to work. (Printing an error message would probably be nice, if memory management hasn't been trashed too badly already at that point)
(And before you ask - it used to be worse...)
Hmm. Had understood the original issue ticket to have a docker image for reproducing the problem, but the google drive seems to contain only screenshots of the docker version and computer hardware involved.
"prod with error" is not expected to work. Ubuntu 16 kernel has no provision of AVX512 XSAVE.
@brada4, do you mean some individual SIMD instruction set extensions are disabled (using XCR0 register) by the virtual OS within Docker, and this is the reason for the segfaults?
AVX512 extensions are visible in KVM guest, but using them leads to numeric eurekas, which likely signify register corruption. HWE kernel (4.15.xxxx-generic) in both sides functions properly. Did not try 18LTS usermode over 16LTS kernel though. Proper tests with significance here would be trying ++NUM_THREADS=1 and ++_COERTYPE=(haswell|sandybridge) and seeing problem goes or stays.
What are numeric eurekas? If I understand correctly, you are saying that anyone with a cpu that supports avx512 extensions should be able to crash OpenBLAS by running OpenBLAS/NumPy tests in a KVM guest?
Seems quite unlikely to me - there was code using AVX512 extensions in 0.3.9 already and the DYNAMIC_ARCH builds are supposed to do a runtime check for actual availability. I'd be much more interested in results from a build with the smaller 0.3.9 BUFFERSIZE, or a simple self-contained reproducer. If anything I would expect AVX512 mishandling to result in SIGILL or SIGFPE (or silent garbage results), not SIGSEGV
Had understood the original issue ticket to have a docker image for reproducing the problem
Sorry, I corrected that in the description. FWIW, the information provided indicates the host machine has 128GB and 31 cores. I can't replicate that locally unfortunately. How much memory does the default BUFFER_SIZE allocate per thread?
128mb, with 0.3.9 it was only 32mb (written as a bitshift in common_x86_64.h, it is "32<<22" vs "32<<20", therefore the suggestion to specify "20" in https://github.com/xianyi/OpenBLAS/issues/2970#issuecomment-721281546
What are numeric eurekas? If I understand correctly, you are saying that anyone with a cpu that supports avx512 extensions should be able to crash OpenBLAS by running OpenBLAS/NumPy tests in a KVM guest?
Completely corrupt numeric results, basically all openblas tests corrupt, ssh drops out often etc. Anyone running Ubuntu 16 in both sides of KVM, rigging compiler to have AVX-512 supported, then yes, it stops working.
This may be unrelated to the other discussion items, but I was successful in finding the offending commit. I also verified that reverting the commit allowed for me to build numpy and run the numpy tests on a 3990X. Offending commit: 3094fc6c83c7a623f9a7e7846eb711a8a99ddfff
Testing steps:
Nix allows you to pin certain packages, and substitute it in all downstream packages. Builds are hermetic, and usually byte reproducible.
Started git bisect on the 3.10 release up until current develop branch git bisect start HEAD 63b03ef
bisect command:
git bisect run nix-build default.nix -A python3Packages.numpy --no-build-output --cores 128
gist of run: https://gist.github.com/jonringer/3d9351b8f1e153f5a7275975880b4319
This also seems to align with my previous comment https://github.com/xianyi/OpenBLAS/issues/2970#issuecomment-722735324, in which the seg fault would occur inside libgomp
libgomp segfault not reproduced on Ryzen5-4600H with current develop (with and without interface64 and symbol suffixing) and gcc 7.5. Rerunning the builds with gcc 9.3 now but no segfault so far.
@jonringer can you try out the NumPy wheels from https://anaconda.org/scipy-wheels-nightly/numpy ? We built the last round of wheels with BUFFERSIZE=20
, so it would be nice to know if that will work as a stop-gap for the upcoming NumPy release (assuming the wheels installed via pip install numpy==1.19.3
, with stock OpenBLAS 0.3.12, crash for you).
Thx @mattip - obviously my 4600 is a poor substitute for Threadripper, but at least his does not some appear to be some separate issue with OpenMP and forks after all. (Wonder how much memory that TR has, obviously a fork() would be a perfect place to run out of it)
Just a heads up, that NumPy is seeing some problems with OpenBLAS 0.3.12.
We released NumPy 1.19.3 with OpenBLAS 0.3.12 in order to fix the windows
fmod
problems. We had to back it out and are releasing a 0.19.4 with the previous OpenBLAS and code to error out if thefmod
bug is detected instead. We got reports that 1.19.3 crashes, xref numpy/numpy#17674 and numpy/numpy#17684. In the first issue, the reporter provided ~a docker image and code to repoduce~ output of some system diagonstics. The second issue is a lack of memory inhstack
, which is difficult to attribute to OpenBLAS, so there may be something else going on. Both issues were "fixed" (or covered up) by the 1.19.4 release candidate with an older OpenBLAS. Does any of this make sense to you?Edit: no docker image, just diagnostics