maybe a problem with 0.3.12 and NumPy?

mattip commented 4 years ago

Just a heads up, that NumPy is seeing some problems with OpenBLAS 0.3.12.

We released NumPy 1.19.3 with OpenBLAS 0.3.12 in order to fix the windows fmod problems. We had to back it out and are releasing a 0.19.4 with the previous OpenBLAS and code to error out if the fmod bug is detected instead. We got reports that 1.19.3 crashes, xref numpy/numpy#17674 and numpy/numpy#17684. In the first issue, the reporter provided ~a docker image and code to repoduce~ output of some system diagonstics. The second issue is a lack of memory in hstack, which is difficult to attribute to OpenBLAS, so there may be something else going on. Both issues were "fixed" (or covered up) by the 1.19.4 release candidate with an older OpenBLAS. Does any of this make sense to you?

Edit: no docker image, just diagnostics

martin-frbg commented 4 years ago

No idea. If anything, thread memory requirements of 0.3.12 should be less than before, and I thought NumPy was satisfied with the state of OpenBLAS after application of the fmod workarounds. Are you bundling the actual 0.3.12 release or some snapshot from around that time ? And the older OpenBLAS you went back to is what - 0.3.10 or even older ?

charris commented 4 years ago

@martin-frbg Our CI testing showed no problems and the wheels builds succeeded, so 0.3.12 looked good prior to release.

EDIT: I also expect a fair number of people tried 1.19.3 because of the Python 3.9 support. If the problem was universal there would probably have been more bug reports.

martin-frbg commented 4 years ago

Well the only change in 0.3.11/12 that I would expect to have any effect at library/thread initialization time is the reduction of the BLAS3_MEM_ALLOC_THRESHOLD variable, and its previous default is stil available in Makefile.rule (or as a parameter to make. Reducing this actually made crashes in other people's code go away however. If your "working" OpenBLAS is older than 0.3.10 then we are looking at a scary number of changes, and probably need a git bisect to get anywhere.

h-vetinari commented 4 years ago

Unfortunately, it's 0.3.9

brada4 commented 4 years ago

Original issues got closed without code change https://github.com/numpy/numpy/issues/17674#issuecomment-720637865 https://github.com/numpy/numpy/issues/17684#issuecomment-720303128

charris commented 4 years ago

@brada4 We switched back to the earlier library for 1.19.4.

martin-frbg commented 4 years ago

Hm thanks. 0.3.10 (or at least the changes that looked important enough to label with the milestone) was mostly CMAKE build improvements and a few thread race fixes, nothing that I would expect to blow up as soon as the library gets loaded.

brada4 commented 4 years ago

The build tag (a32f1dca) on OpenBLAS so file is not related to any tags here. Where it does come from, how do we trace back to OpenBLAS source code used to build the library? Backtrace in those issues says last OpenBLAS thread jumped to 0x0 that should not happen, and without backtrace hard to tell what made it do so. Also given threads mentioned - would it be possible to run same crashing docker with OPENBLAS_NUM_THREADS=1 added ?

mattip commented 4 years ago

The descriptive string v0.3.7-527-g79fd006c is the result of git describe --tags --abbrev=8. The OpenBLAS commit is 79fd006c and is somewhere between 0.3.9 and 0.3.10.

mattip commented 4 years ago

According to the comment in the issue there is information at https://drive.google.com/drive/folders/18mcYQu4GGPzwCRj14Pzy8LvfC9bcLid8?usp=sharing on the docker environment where the segfault occurs.

brada4 commented 4 years ago

It would take ages to reconstruct failing environment 1:1. The question is if it is possible to clearly attribute problem to "OpenBLAS threading" by confirming there is no problem whatsoever when threading thing is turned off via env variable.

Issue mentions /virtualenv/python3.6/lib/python3.6/site-packages/numpy.libs/libopenblasp-r0-a32f1dca.3.12.so - github search points it to this thread, and full search to same tag mentioned in exact numpy binary build) but not to the code - could you help to dig it to OpenBLAS tag? I am not picky or anything, would be easier for all if we know numerology behind file tagging and can admit guilty right away ;-)

mattip commented 4 years ago

@brada4 I am not sure what you are asking. A static build of OpenBLAS is downloaded and is built into the so you see with some other things as part of the NumPy build process. The reason it has a hash is to uniquely bind that so to the NumPy build, it does not directly reflect a version of OpenBLAS. The exact version of OpenBLAS used in the build of the wheel is the code tagged with the 1.19.3 tag, and is here - you can see the tag is 0.3.12. You can download the wheel and check the result of openblas_get_config() as we do later in that file to verify that the OpenBLAS version is the one we thought it was.

brada4 commented 4 years ago

@mattip thanks, that fills the gap in my understanding.

carlkl commented 4 years ago

One thing that was changed in 3.12 is x86_64: clobber all xmm registers after vzeroupper. See also Omit clobbers from vzeroupper until final [PR92190]. And Microsoft x64 calling convention tells, that: user-written assembly language routines must preserve XMM6 and XMM7. Is this patch really necessary? And is this valid for the Windows 64 ABI?

carlkl commented 4 years ago

Another bit can be found here: Software consequences of extending XMM to YMM (Agner Fog)

However, this scenario is only relevant if the legacy function saves and restores the XMM register, and this happens only in 64-bit Windows. The ABI for 64-bit Windows specifies that register XMM6 - XMM15 have callee-save status, i.e. these registers must be saved and restored if they are used. All other x86 operating systems (32-bit Windows, 32- and 64-bit Linux, BSD and Mac) have no XMM registers with callee-save status. So this discussion is relevant only to 64-bit Windows. There can be no problem in any other operating system because there are no legacy functions that save these registers anyway.

mattip commented 4 years ago

Both the users who reported this issue are using dockers on linux, right?

martin-frbg commented 4 years ago

Callee-saves is implemented in the assembly PROLOGUE/EPILOGUE for Windows so this particular change should play no role here

bashtage commented 4 years ago

OpenBLAS 0.3.12 is making large memory requests on initialization that depdne heavily on the number of threads. With 24 threads, I am seeing:

This scales pretty linearly with the number of threads when I set $env:OPENBLAS_NUM_THREADS.

bashtage commented 4 years ago

When I roll back to 0.3.9 the memory ask about 1/4:

I checked 0.3.10 and it also shows high memory usage, so it seems like something between these two releases.

martin-frbg commented 4 years ago

That then is probably the GEMM buffer, configurable as BUFFERSIZE at compile time (in Makefile.rule) The default was increased a few times recently to fix overflows at huge matrix sizes - there is a certainly a trade-off and possibly a fundamental design flaw in OpenBLAS involved.0.3.12 built with make BUFFERSIZE=20 should have about the same footprint as 0.3.9. (see #1698 for original issue, #2538 for the motivation to actually change the defaults)

mattip commented 4 years ago

If we shrink the BUFFERSIZE, would we run the risk of running out of room with 24 threads and large matrices? Should NumPy cap the number of threads with openblas_set_num_threads ?

martin-frbg commented 4 years ago

Yes, in the same way as 0.3.9 and earlier but not worse. 2. Probably yes, though that seems to depend on use case (method and context of loading numpy/openblas). All this assuming that BUFFERSIZE is actually the culprit - I have not gotten around to trying the dockerfile from the numpy ticket, and that ticket did not mention hardware constraints)

jonringer commented 4 years ago

bumping from 0.3.10 to 0.3.12 in nixpkgs has caused numpy to fail on tests now https://github.com/NixOS/nixpkgs/pull/101780#issuecomment-717440602

stacktrace:

(gdb) backtrace
#0  0x00007fffe33a327a in ?? () from /nix/store/931l4l3ab5fg1x4cf0wx8pqg1prqgdmj-gfortran-9.3.0-lib/lib/libgomp.so.1
#1  0x00007fffe33a1e09 in ?? () from /nix/store/931l4l3ab5fg1x4cf0wx8pqg1prqgdmj-gfortran-9.3.0-lib/lib/libgomp.so.1
#2  0x00007fffe88fc11f in exec_blas () from /nix/store/zk5s85i72gnzfrz95r0n783zvgh11ndm-lapack-3/lib/liblapack.so.3
#3  0x00007fffe87c9d93 in gemm_driver () from /nix/store/zk5s85i72gnzfrz95r0n783zvgh11ndm-lapack-3/lib/liblapack.so.3
#4  0x00007fffe87c9f67 in dgemm_thread_nn () from /nix/store/zk5s85i72gnzfrz95r0n783zvgh11ndm-lapack-3/lib/liblapack.so.3
#5  0x00007fffe86dedc7 in cblas_dgemm () from /nix/store/zk5s85i72gnzfrz95r0n783zvgh11ndm-lapack-3/lib/liblapack.so.3
#6  0x00007fffea2e87c1 in cblas_matrixproduct ()
   from /nix/store/1dynhpbdks2lzpjbkz1i5p0rkm5klfn4-python3.8-numpy-1.19.4/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so

lscpu info

``` $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 43 bits physical, 48 bits virtual CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD Ryzen Threadripper 3990X 64-Core Processor Stepping: 0 Frequency boost: enabled CPU MHz: 3776.522 CPU max MHz: 2900.0000 CPU min MHz: 2200.0000 BogoMIPS: 5800.49 Virtualization: AMD-V L1d cache: 2 MiB L1i cache: 2 MiB L2 cache: 32 MiB L3 cache: 256 MiB NUMA node0 CPU(s): 0-127 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditional, STIBP conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr ss e sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_t sc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe po pcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalign sse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves c qm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthres hold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca ```

martin-frbg commented 4 years ago

@jonringer this looks like something else, not the BUFFERSIZE thing. Wonder why the backtrace shows the OpenMP library and a liblapack but nothing that calls itself OpenBLAS - do you have libopenblas aliased to liblapack, or could this be a mixup of library builds ? (So far I have not seen mention of numpy itself failing tests with 0.3.12, cf charris' comment above https://github.com/xianyi/OpenBLAS/issues/2970#issuecomment-720604628 ) (Accidentally closed this due to touchpad malfunction...)

mattip commented 4 years ago

I am a little hesitant about overriding the OpenBLAS default BUFFERSIZE constant in NumPy's build process. OpenBLAS changed it for good reasons, and it could cause incompatibilities. Is refactoring this on the roadmap?

mattip commented 4 years ago

And we are not even sure that will solve the problem

martin-frbg commented 4 years ago

On the long-term roadmap but I keep bumping the milestone - this is at the very core of OpenBLAS and unless it can be traced to some silly typo from the last couple of years, an attempted fix could have worse consequences than what I experienced wth the ill-fated 0.3.11. The TLS code would probably help but that needs reviewing too. I do not think reducing the BUFFERSIZE would cause more potential incompatibilites than the rollback to 0.3.9 with its smaller default, but I have not even gotten around to trying the docker reproducer. Perhaps the fundamental problem is that GotoBLAS was designed to be compiled for a specific purpose, with parameters adjusted as needed for that particular user (and at a time when typical problem sizes and thread counts were both smaller) . Now with various distributors we have a "one size fits all" thing where some want it to have a minimal footprint while others want to diagonalize huge matrices.

carlkl commented 4 years ago

Another question is for the Windows world: do we really need 24 threads for "one size fits all" applications? I presume 12 threads is more than enough for most Windows applications given the typically equipped machines. If you really need more you are asked than to compile OpenBLAS yourself with more threads, buffersize.

brada4 commented 4 years ago

ThreadRipper Pro owners would tend to disagree with such over-generalisation.

carlkl commented 4 years ago

@brada4, thats right. But owners of older CPUs or with poorly equipped RAM may agree. This is however much more a question concerning numpy deployment, so sorry for the noise.

bashtage commented 4 years ago

ThreadRipper Pro owners would tend to disagree with such over-generalisation.

I think most desktop users are unlikely to have a problem. I believe the problem was an experience in containers where the true CPU count of the system was reported even though the container has some limits on its resource use (e.g., memory allocation). Launching NumPy + OpenBLAS on a system with 96 VCPUs could request ~24GiB of memory.

I think many desktop users will have ~2GiB+/vCPU, and so would not directly see this issue. AFAICT all of the reports of Windows problems were about containers.

The Linux issues maybe something else.

Edit: This said, I did experience an issue when I ran a test suite with 12 workers without setting OPENBLAS_NUM_THREAD where each worker was requesting 6GiB. It didn't crash but the test suite had strange errors such as unexplained segfaults.

brada4 commented 4 years ago

@carlkl - waitasec - you mean the huge allocation is per configured maximum and not per actual CPU cores visible? Thats certainly in excess.... Typical systems have gigabyte to ten per core, just that it is not meant all for a single task.

jonringer commented 4 years ago

do you have libopenblas aliased to liblapack, or could this be a mixup of library builds ?

Yes, Nixpkgs has them aliased.

$ ls -l ./result/lib/lib*
lrwxrwxrwx  23 root 31 Dec  1969 ./result/lib/libblas.so -> libopenblasp-r0.3.12.so
lrwxrwxrwx  23 root 31 Dec  1969 ./result/lib/libblas.so.3 -> libopenblasp-r0.3.12.so
lrwxrwxrwx  23 root 31 Dec  1969 ./result/lib/libcblas.so -> libopenblasp-r0.3.12.so
lrwxrwxrwx  23 root 31 Dec  1969 ./result/lib/libcblas.so.3 -> libopenblasp-r0.3.12.so
lrwxrwxrwx  23 root 31 Dec  1969 ./result/lib/liblapack.so -> libopenblasp-r0.3.12.so
lrwxrwxrwx  23 root 31 Dec  1969 ./result/lib/liblapack.so.3 -> libopenblasp-r0.3.12.so
lrwxrwxrwx  23 root 31 Dec  1969 ./result/lib/liblapacke.so -> libopenblasp-r0.3.12.so
lrwxrwxrwx  23 root 31 Dec  1969 ./result/lib/liblapacke.so.3 -> libopenblasp-r0.3.12.so
lrwxrwxrwx  23 root 31 Dec  1969 ./result/lib/libopenblas.so -> libopenblasp-r0.3.12.so
lrwxrwxrwx  23 root 31 Dec  1969 ./result/lib/libopenblas.so.0 -> libopenblasp-r0.3.12.so
.r-xr-xr-x 27M root 31 Dec  1969 ./result/lib/libopenblasp-r0.3.12.so

In the case of numpy, it was linked against the lapack package which took the shared libraries from libopenblas.

blas libraries are "normalized" in their outputs, so that users can choose their blas implementation and rebuild all dependent packages; by default, openblas will be chosen.

carlkl commented 4 years ago

@brada4, this is from Makefile.rule _Due to the way some internal structures are allocated, using a large NUMTHREADS value has a RAM footprint penalty, even if users reduce the actual number of threads at runtime. However, I'm not sure if the numper of actual CPU cores limit this footprint at startup to this: <= max(NUM_THREADS, no. of CPU cores). @martin-frbg should know.

brada4 commented 4 years ago

I remember strictly adding one page in every few-processor build once #1858, but that is non-tls stuff. Probably more feet were added in the meantime.

mattip commented 4 years ago

It didn't crash but the test suite had strange errors such as unexplained segfaults.

Could it be that not all malloc calls are checked for error like in

get_memory_table
blas_memory_alloc_nolock (which appears to be a helper function exported in common.h
calls in the codebase to blas_memory_alloc, like the ones for PPC440

martin-frbg commented 4 years ago

Obviously... OTOH it is a bit unclear how OpenBLAS should react, "historically" some things are simply expected to work. (Printing an error message would probably be nice, if memory management hasn't been trashed too badly already at that point)

(And before you ask - it used to be worse...)

martin-frbg commented 4 years ago

Hmm. Had understood the original issue ticket to have a docker image for reproducing the problem, but the google drive seems to contain only screenshots of the docker version and computer hardware involved.

brada4 commented 4 years ago

"prod with error" is not expected to work. Ubuntu 16 kernel has no provision of AVX512 XSAVE.

carlkl commented 4 years ago

@brada4, do you mean some individual SIMD instruction set extensions are disabled (using XCR0 register) by the virtual OS within Docker, and this is the reason for the segfaults?

brada4 commented 4 years ago

AVX512 extensions are visible in KVM guest, but using them leads to numeric eurekas, which likely signify register corruption. HWE kernel (4.15.xxxx-generic) in both sides functions properly. Did not try 18LTS usermode over 16LTS kernel though. Proper tests with significance here would be trying ++NUM_THREADS=1 and ++_COERTYPE=(haswell|sandybridge) and seeing problem goes or stays.

mattip commented 4 years ago

What are numeric eurekas? If I understand correctly, you are saying that anyone with a cpu that supports avx512 extensions should be able to crash OpenBLAS by running OpenBLAS/NumPy tests in a KVM guest?

martin-frbg commented 4 years ago

Seems quite unlikely to me - there was code using AVX512 extensions in 0.3.9 already and the DYNAMIC_ARCH builds are supposed to do a runtime check for actual availability. I'd be much more interested in results from a build with the smaller 0.3.9 BUFFERSIZE, or a simple self-contained reproducer. If anything I would expect AVX512 mishandling to result in SIGILL or SIGFPE (or silent garbage results), not SIGSEGV

mattip commented 4 years ago

Had understood the original issue ticket to have a docker image for reproducing the problem

Sorry, I corrected that in the description. FWIW, the information provided indicates the host machine has 128GB and 31 cores. I can't replicate that locally unfortunately. How much memory does the default BUFFER_SIZE allocate per thread?

martin-frbg commented 4 years ago

128mb, with 0.3.9 it was only 32mb (written as a bitshift in common_x86_64.h, it is "32<<22" vs "32<<20", therefore the suggestion to specify "20" in https://github.com/xianyi/OpenBLAS/issues/2970#issuecomment-721281546

brada4 commented 4 years ago

What are numeric eurekas? If I understand correctly, you are saying that anyone with a cpu that supports avx512 extensions should be able to crash OpenBLAS by running OpenBLAS/NumPy tests in a KVM guest?

Completely corrupt numeric results, basically all openblas tests corrupt, ssh drops out often etc. Anyone running Ubuntu 16 in both sides of KVM, rigging compiler to have AVX-512 supported, then yes, it stops working.

jonringer commented 4 years ago

This may be unrelated to the other discussion items, but I was successful in finding the offending commit. I also verified that reverting the commit allowed for me to build numpy and run the numpy tests on a 3990X. Offending commit: 3094fc6c83c7a623f9a7e7846eb711a8a99ddfff

Testing steps: Nix allows you to pin certain packages, and substitute it in all downstream packages. Builds are hermetic, and usually byte reproducible. Started git bisect on the 3.10 release up until current develop branch git bisect start HEAD 63b03ef

default.nix

```nix let openblas_overlay = self: super: { # replace all references of openblas in nixpkgs with local version openblas = super.openblas.overrideAttrs (oldAttrs: { src = super.lib.cleanSource ./.; }); }; in # create new package set, referencing local version of openblas import ../nixpkgs { overlays = [ openblas_overlay ]; } ```

bisect command:

git bisect run nix-build default.nix -A python3Packages.numpy --no-build-output --cores 128

gist of run: https://gist.github.com/jonringer/3d9351b8f1e153f5a7275975880b4319

This also seems to align with my previous comment https://github.com/xianyi/OpenBLAS/issues/2970#issuecomment-722735324, in which the seg fault would occur inside libgomp

martin-frbg commented 3 years ago

libgomp segfault not reproduced on Ryzen5-4600H with current develop (with and without interface64 and symbol suffixing) and gcc 7.5. Rerunning the builds with gcc 9.3 now but no segfault so far.

mattip commented 3 years ago

@jonringer can you try out the NumPy wheels from https://anaconda.org/scipy-wheels-nightly/numpy ? We built the last round of wheels with BUFFERSIZE=20, so it would be nice to know if that will work as a stop-gap for the upcoming NumPy release (assuming the wheels installed via pip install numpy==1.19.3, with stock OpenBLAS 0.3.12, crash for you).

martin-frbg commented 3 years ago

Thx @mattip - obviously my 4600 is a poor substitute for Threadripper, but at least his does not some appear to be some separate issue with OpenMP and forks after all. (Wonder how much memory that TR has, obviously a fork() would be a perfect place to run out of it)

OpenMathLib / OpenBLAS

maybe a problem with 0.3.12 and NumPy? #2970