openblas v0.3.14 - Githubissues

regro-cf-autotick-bot commented 3 years ago

It is very likely that the current package version for this feedstock is out of date. Notes for merging this PR:

Feel free to push to the bot's branch to update this PR if needed.
The bot will almost always only open one PR per version. Checklist before merging this PR:
- [ ] Dependencies have been updated if changed: see upstream
- [ ] Tests have passed
- [ ] Updated license if changed and license_file is packaged

Note that the bot will stop issuing PRs if more than 3 Version bump PRs generated by the bot are open. If you don't want to package a particular version please close the PR.

NEW: If you want these PRs to be merged automatically, make an issue with code>@conda-forge-admin,</codeplease add bot automerge in the title and merge the resulting PR. This command will add our new bot automerge feature to your feedstock!

If this PR was opened in error or needs to be updated please add the bot-rerun label to this PR. The bot will close this PR and schedule another one. If you do not have permissions to add this label, you can use the phrase code>@<space/conda-forge-admin, please rerun bot in a PR comment to have the conda-forge-admin add it for you.

_{This PR was created by the regro-cf-autotick-bot.
The regro-cf-autotick-bot is a service to automatically track the dependency graph, migrate packages, and propose package version updates for conda-forge. If you would like a local version of this bot, you might consider using rever. Rever is a tool for automating software releases and forms the backbone of the bot's conda-forge PRing capability. Rever is both conda (conda install -c conda-forge rever) and pip (pip install re-ver) installable.
Finally, feel free to drop us a line if there are any issues!
This PR was generated by https://github.com/regro/autotick-bot/actions/runs/662380783, please use this URL for debugging}

Here is a list of all the pending dependencies (and their versions) for this repo. Please double check all dependencies before merging.

Name	Upstream Version	Current Version
openblas	0.3.14

Dependency Analysis

We couldn't run dependency analysis due to an internal error in the bot. :( Help is very welcome!

conda-forge-linter commented 3 years ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

h-vetinari commented 3 years ago

If I understand https://github.com/xianyi/OpenBLAS/pull/3102 correctly, 0001-Fix-gfortran-detection-for-ctng-based-cross-compiler.patch shouldn't be necessary anymore, so I dropped it. Can of course be re-added if I'm overlooking something.

h-vetinari commented 3 years ago

There's an error for osx-arm64:

In file included from getarch_2nd.c:7:
In file included from ./common.h:457:
./common_arm64.h:69:38: error: unknown register name 'x2' in asm
                         : "memory", "x2" , "x3", "x4"
                                     ^
1 error generated.
make: *** [Makefile.prebuild:74: getarch_2nd] Error 1

The code in question hasn't been touched in 5 years, but I believe think it might be due to https://github.com/xianyi/OpenBLAS/commit/34753eaebb8b2ddbc256e9e996c1fb315396a2a0. I'll try if reverting makes a difference.

h-vetinari commented 3 years ago

Some segfaults on osx with openmp:

OMP_NUM_THREADS=2 ./xccblat1
OMP_NUM_THREADS=2 ./xscblat3 < sin3

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
 TESTS OF THE REAL             LEVEL 2 BLAS

h-vetinari commented 3 years ago

@conda-forge/openblas Any ideas how to tackle the openmp segfaults on osx?

(Also, based on conda_build_config, I'd expect 4 jobs for osx_64 like for linux, but we only get 2...)

h-vetinari commented 3 years ago

Ping @conda-forge/openblas

h-vetinari commented 3 years ago

@martin-frbg Do such segfaults on osx ring a bell for you?

martin-frbg commented 3 years ago

Not immediately - but I realize now that OpenBLAS lacks. a CI job for osx in combination with OpenMP. (I do not have any Apple hardware locally). UPDATE: error not seen with gcc-10/gfortran-10 on Azure, still trying to get around the apparent lack of OpenMP support in AppleClang

martin-frbg commented 3 years ago

Have now added a build with the homebrew llvm 11.1 and gfortran-10 to the OpenBLAS Azure config, and it passes.

isuruf commented 3 years ago

@martin-frbg, is that with INTERFACE64=1?

martin-frbg commented 3 years ago

No - did not notice that detail. (Do not think anything changed lately with respect to that option though)

isuruf commented 3 years ago

Yeah, it's weird. INTERFACE64=0 doesn't segfault and only INTERFACE64=1 does.

martin-frbg commented 3 years ago

Reproduced on Azure, repeating with a DEBUG=1 build now to try and find out what&where EDIT: inconclusive - could not redo the previous DYNAMIC_ARCH build as clang with -g ran out of registers in some kernels. Building for the host produced a Sandybridge kernel that did not segfault in any of the tests, while your build appears to have been for Haswell ?

h-vetinari commented 3 years ago

Thanks for looking at this @martin-frbg!

Copying your edit here since those don't produce new pings and are therefore easy to overlook:

@martin-frbg: EDIT: inconclusive - could not redo the previous DYNAMIC_ARCH build as clang with -g ran out of registers in some kernels. Building for the host produced a Sandybridge kernel that did not segfault in any of the tests, while your build appears to have been for Haswell ?

Conda-forge usually goes for the lowest (reasonable) common denominator; in build.sh you can see that the target for osx is TARGET="CORE2".

martin-frbg commented 3 years ago

Retried with TARGET=CORE2 now and did not see any segfault either.

martin-frbg commented 3 years ago

Perhaps if you could add DEBUG=1 to your build, this would provide a hint where to look ? I believe I am using the same base image for the Azure job, but most likely a different build of clang

h-vetinari commented 3 years ago

@martin-frbg I added DEBUG="1" to the osx build, but now sure what I'm looking for...

martin-frbg commented 3 years ago

@h-vetinari it is confusing me as well - I would have expected the backtrace to show function names and ideally even source lines from OpenBLAS now, but perhaps the/your osx environment does not work that way ?

martin-frbg commented 3 years ago

The quotes do not matter on Linux (at least) - basically all the DEBUG=1 does is add -g to the compiler flags. Unfortunately I know too little about the osx/xcode/conda environment to assess whether this is sufficient to create debugging information in a format that the default tools can utilize.

h-vetinari commented 3 years ago

I'm afraid we'll need to wait on this until @isuruf has time for comments / input.

martin-frbg commented 3 years ago

Saw a stackoverflow post that suggests the stacktrace should be symbolized just like on Linux (and the -g is visible in the build log). On the other hand there appears to be a dedicated tool named llvm-symbolizer to produce source line output from an object name and address.

h-vetinari commented 3 years ago

@isuruf

Could you please help with getting a DEBUG build running here to be able to see the stacktrace etc.? I failed with my approach.

martin-frbg commented 3 years ago

Unfortunately this has now caught on two implicitly defined variables in LAPACK (presumably something else brought in -fimplicit-none), trivial fix is in https://github.com/xianyi/OpenBLAS/pull/3178

h-vetinari commented 3 years ago

@martin-frbg It would be pretty easy to carry this patch, but before doing that - 0.3.14 was downgraded to a pre-release. Does it makes sense in your opinion to still pursue this PR (or wait for 0.3.15)? AFAICT, this was due to some AVX512 regression, which shouldn't affect the builds here, as we don't make use of such advanced instructions.

martin-frbg commented 3 years ago

I don't really know - from my perspective, 0.3.15 is held back in part by this issue... but it now looks to me as if we are (maybe) back to the bad situation where we are linking both GNU libgomp and LLVM libomp.

martin-frbg commented 3 years ago

(when we are using gfortran to compile and link, that is - as in the test and ctest, where -fopenmp automatically implies -lgomp and the build system adds -lomp - ISTR that LLVM tries/tried to work around this situation by symlinking their libomp to the libgomp name but perhaps that was discontinued or does not work as expected here)

martin-frbg commented 3 years ago

The CI logs for https://github.com/xianyi/OpenBLAS/pull/3166 show a linker warning about a mismatch between what the libraries from homebrew were built on and the actual OSX version in use (10.15 vs. 10.8), but the side effect is that it clearly shows that both libgomp.dylib from gcc10 and libomp.dylib from llvm end up in the ctest binaries despite having only -lomp expressly on the command line. I suspect this is the case in your failing builds as well, except that you do not get to see the "convenient" warning as your library versions are properly matched. This is utterly frustrating - seems there is (still) no reliable way to consolidate use of exactly one OpenMP runtime in a mixed-language project that is built with both LLVM and GCC compilers. gfortran has -fopenmp to enable both parsing of OpenMP pragmas and implied linking with libgomp, and clang likewise uses -fopenmp to set up for and link with its own libomp/libiomp. ABI compatibility may be sufficient so that use of *either' library works at runtime, but only as long as symlink trickery ensures that both names actually lead to the same object. If that is not the case, both libgomp and libomp appear to get loaded, probably overwriting each other's symbol tables and local variables. (Seems there briefly was a concept of supporting -fopenmp=libgomp to ensure actual linking with the GNU implementation but by all accounts it is a no-op).

isuruf commented 3 years ago

I suspect this is the case in your failing builds as well, except that you do not get to see the "convenient" warning as your library versions are properly matched.

No, we use libomp.dylib exclusively and libgomp.1.dylib is a symlink to libomp.dylib

martin-frbg commented 3 years ago

Sure there is no libgomp.dylib in your CI ? Crashing somewhere in the OpenMP library would seem to be a plausible explanation fro the lack of symbols even after building with -g (much as I hate the scenario I laid out above). Maybe I will have to give in and buy a Mac specifically for debugging OpenBLAS, there are just too many unknowns in the OSX ecosystem for me to tackle this with remote CI builds alone.

isuruf commented 3 years ago

Maybe I will have to give in and buy a Mac specifically for debugging OpenBLAS, there are just too many unknowns in the OSX ecosystem for me to tackle this with remote CI builds alone.

Hmm, let me see how I can get you access to a Mac.

isuruf commented 3 years ago

I can't reproduce this on an ivybridge macos.

martin-frbg commented 3 years ago

Would match my experience building for the CI target (Sandybridge as far as OpenBLAS is concerned) instead of DYNAMIC_ARCH. No idea yet why that would make a difference (nor am I aware of any 0.3.14 changes that would affect OpenMP)

isuruf commented 3 years ago

Could reproduce on a friend's sandybridge mac.

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x7)
  * frame #0: 0x0000000000000007
    frame #1: 0x0000000100025c98 xscblat2`cblas_sgemv(order=<unavailable>, TransA=<unavailable>, m=2, n=1, alpha=<unavailable>, a=0x00007ffeefbf32a0, lda=3, x=0x00007ffeefbee850, incx=1, beta=<unavailable>, y=0x00007ffeefbeec70, incy=1) at gemv.c:201 [opt]
    frame #2: 0x000000010001f3e3 xscblat2`csgemv_(order=<unavailable>, transp=<unavailable>, m=0x00007ffeefbedb68, n=0x00007ffeefbedb78, alpha=0x00007ffeefbedb0c, a=0x00007ffeefbf32a0, lda=0x00007ffeefbedb50, x=0x00007ffeefbee850, incx=0x00007ffeefbedb28, beta=0x00007ffeefbedb10, y=0x00007ffeefbeec70, incy=0x00007ffeefbedb30) at c_sblas2.c:0 [opt]
    frame #3: 0x000000010001b412 xscblat2`schk1_ at c_sblat2.f:655
    frame #4: 0x000000010001e615 xscblat2`MAIN__ at c_sblat2.f:319
    frame #5: 0x0000000102d3c688 xscblat2`main at c_sblat2.f:455

isuruf commented 3 years ago

If I remove DYNAMIC_ARCH=1, segfault is gone.

isuruf commented 3 years ago

TARGET=PRESCOTT DYNAMIC_LIST=PRESCOTT DYNAMIC_ARCH=1 fails TARGET=PRESCOTT DYNAMIC_ARCH=0 succeeds

isuruf commented 3 years ago

Is it intentional that ctest/c_sblas2.c uses int instead of blasint?

martin-frbg commented 3 years ago

No, but I'm pretty sure these tests were simply borrowed from the reference BLAS at some time in the (probably distant) past where blasint did not exist. (valgrind has turned up some oddities in the meantime, but again nothing from recent changes, maybe this occurring now is just a coincidence of the compilers having become smart enough to do dangerous things to fragile code)

martin-frbg commented 3 years ago

I no longer think the int/blasint issue has any bearing on the segfault, but I do not yet know what has. With separate libgomp and libomp on Linux, valgrind tells me that the LLVM libomp is making a call to sched_setaffinity with a NULL mask in response to the omp_get_max_threads() query in num_cpu_avail (common_thread.h) - probably a sign of the two OpenMP implementations colliding, or at least one not being initialized correctly. For TARGET=SANDYBRIDGE, not other error is recorded, while for the older PRESCOTT and CORE2 targets an illegal read of 8 bytes beyond the actual data happens in their SSE ssymv_U kernel. (Neither of these is fatal in normal operations)

martin-frbg commented 3 years ago

Finally got a working valgrind for OSX (via brew tap LouisBrunner/valgrind) but did not more information than what was in your lldb backtrace:

2021-04-13T14:37:04.7064630Z ==54809== Jump to the invalid address stated on the next line
2021-04-13T14:37:04.7070760Z ==54809==    at 0x6: ???
2021-04-13T14:37:04.7072620Z ==54809==    by 0x100022403: csgemv_ (c_sblas2.c:31)
2021-04-13T14:37:04.7073920Z ==54809==    by 0x10001D969: schk1_ (c_sblat2.f:655)
2021-04-13T14:37:04.7076220Z ==54809==    by 0x1000212C1: MAIN__ (c_sblat2.f:319)
2021-04-13T14:37:04.7081440Z ==54809==    by 0x1000221C4: main (c_sblat2.f:455)
2021-04-13T14:37:04.7083720Z ==54809==  Address 0x6 is not stack'd, malloc'd or (recently) free'd
2021-04-13T14:37:04.7491950Z ==54809== Process terminating with default action of signal 11 (SIGSEGV)
2021-04-13T14:37:04.7493870Z ==54809==    at 0x6: ???
2021-04-13T14:37:04.7495920Z ==54809==    by 0x100022403: csgemv_ (c_sblas2.c:31)
2021-04-13T14:37:04.7498490Z ==54809==    by 0x10001D969: schk1_ (c_sblat2.f:655)
2021-04-13T14:37:04.7500500Z ==54809==    by 0x1000212C1: MAIN__ (c_sblat2.f:319)
2021-04-13T14:37:04.7502170Z ==54809==    by 0x1000221C4: main (c_sblat2.f:455)

which is looking as if the function address of cblas_sgemv is corrupted (and similarly for the CBLAS3 test, it claims that in interface/gemm.c line 437 the respective gemm_XX function pointer is NULL)

h-vetinari commented 3 years ago

Thanks a lot @martin-frbg for working hard to figure this one out!

martin-frbg commented 3 years ago

Actually it appears to be the function pointer to the SSCAL kernel that gets trashed "at some point" in the CBLAS2 test (the SCAL_K in interface/gemv.c that maps to gotoblas->scal_k_SANDYBRIDGE) , so it is the first invocation of SGEMV with a beta not equal to 1 that crashes. No idea yet why/where it gets overwritten in xscblat2, but the problem does not occur with gcc+gfortran so could be a clang or interoperability issue.

conda-forge / openblas-feedstock

openblas v0.3.14 #116

Dependency Analysis