Closed regro-cf-autotick-bot closed 3 years ago
Hi! This is the friendly automated conda-forge-linting service.
I just wanted to let you know that I linted all conda-recipes in your PR (recipe
) and found it was in an excellent condition.
If I understand https://github.com/xianyi/OpenBLAS/pull/3102 correctly, 0001-Fix-gfortran-detection-for-ctng-based-cross-compiler.patch
shouldn't be necessary anymore, so I dropped it. Can of course be re-added if I'm overlooking something.
There's an error for osx-arm64:
In file included from getarch_2nd.c:7:
In file included from ./common.h:457:
./common_arm64.h:69:38: error: unknown register name 'x2' in asm
: "memory", "x2" , "x3", "x4"
^
1 error generated.
make: *** [Makefile.prebuild:74: getarch_2nd] Error 1
The code in question hasn't been touched in 5 years, but I believe think it might be due to https://github.com/xianyi/OpenBLAS/commit/34753eaebb8b2ddbc256e9e996c1fb315396a2a0. I'll try if reverting makes a difference.
Some segfaults on osx with openmp:
OMP_NUM_THREADS=2 ./xccblat1
OMP_NUM_THREADS=2 ./xscblat3 < sin3
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
TESTS OF THE REAL LEVEL 2 BLAS
@conda-forge/openblas Any ideas how to tackle the openmp segfaults on osx?
(Also, based on conda_build_config
, I'd expect 4 jobs for osx_64 like for linux, but we only get 2...)
Ping @conda-forge/openblas
@martin-frbg Do such segfaults on osx ring a bell for you?
Not immediately - but I realize now that OpenBLAS lacks. a CI job for osx in combination with OpenMP. (I do not have any Apple hardware locally). UPDATE: error not seen with gcc-10/gfortran-10 on Azure, still trying to get around the apparent lack of OpenMP support in AppleClang
Have now added a build with the homebrew llvm 11.1 and gfortran-10 to the OpenBLAS Azure config, and it passes.
@martin-frbg, is that with INTERFACE64=1
?
No - did not notice that detail. (Do not think anything changed lately with respect to that option though)
Yeah, it's weird. INTERFACE64=0
doesn't segfault and only INTERFACE64=1
does.
Reproduced on Azure, repeating with a DEBUG=1 build now to try and find out what&where
EDIT: inconclusive - could not redo the previous DYNAMIC_ARCH build as clang with -g
ran out of registers in some kernels.
Building for the host produced a Sandybridge kernel that did not segfault in any of the tests, while your build appears to have been for Haswell ?
Thanks for looking at this @martin-frbg!
Copying your edit here since those don't produce new pings and are therefore easy to overlook:
@martin-frbg: EDIT: inconclusive - could not redo the previous DYNAMIC_ARCH build as clang with
-g
ran out of registers in some kernels. Building for the host produced a Sandybridge kernel that did not segfault in any of the tests, while your build appears to have been for Haswell ?
Conda-forge usually goes for the lowest (reasonable) common denominator; in build.sh
you can see that the target for osx is TARGET="CORE2"
.
Retried with TARGET=CORE2 now and did not see any segfault either.
Perhaps if you could add DEBUG=1 to your build, this would provide a hint where to look ? I believe I am using the same base image for the Azure job, but most likely a different build of clang
@martin-frbg
I added DEBUG="1"
to the osx build, but now sure what I'm looking for...
@h-vetinari it is confusing me as well - I would have expected the backtrace to show function names and ideally even source lines from OpenBLAS now, but perhaps the/your osx environment does not work that way ?
The quotes do not matter on Linux (at least) - basically all the DEBUG=1 does is add -g
to the compiler flags. Unfortunately I know too little about the osx/xcode/conda environment to assess whether this is sufficient to create debugging information in a format that the default tools can utilize.
I'm afraid we'll need to wait on this until @isuruf has time for comments / input.
Saw a stackoverflow post that suggests the stacktrace should be symbolized just like on Linux (and the -g
is visible in the build log). On the other hand there appears to be a dedicated tool named llvm-symbolizer
to produce source line output from an object name and address.
@isuruf
Could you please help with getting a DEBUG build running here to be able to see the stacktrace etc.? I failed with my approach.
Unfortunately this has now caught on two implicitly defined variables in LAPACK (presumably something else brought in -fimplicit-none
), trivial fix is in https://github.com/xianyi/OpenBLAS/pull/3178
@martin-frbg It would be pretty easy to carry this patch, but before doing that - 0.3.14 was downgraded to a pre-release. Does it makes sense in your opinion to still pursue this PR (or wait for 0.3.15)? AFAICT, this was due to some AVX512 regression, which shouldn't affect the builds here, as we don't make use of such advanced instructions.
I don't really know - from my perspective, 0.3.15 is held back in part by this issue... but it now looks to me as if we are (maybe) back to the bad situation where we are linking both GNU libgomp and LLVM libomp.
(when we are using gfortran to compile and link, that is - as in the test and ctest, where -fopenmp
automatically implies -lgomp and the build system adds -lomp
- ISTR that LLVM tries/tried to work around this situation by symlinking their libomp to the libgomp name but perhaps that was discontinued or does not work as expected here)
The CI logs for https://github.com/xianyi/OpenBLAS/pull/3166 show a linker warning about a mismatch between what the libraries from homebrew were built on and the actual OSX version in use (10.15 vs. 10.8), but the side effect is that it clearly
shows that both libgomp.dylib from gcc10 and libomp.dylib from llvm end up in the ctest binaries despite having only -lomp expressly on the command line. I suspect this is the case in your failing builds as well, except that you do not get to see the "convenient" warning as your library versions are properly matched.
This is utterly frustrating - seems there is (still) no reliable way to consolidate use of exactly one OpenMP runtime in a mixed-language project that is built with both LLVM and GCC compilers. gfortran has -fopenmp
to enable both parsing of OpenMP pragmas and implied linking with libgomp, and clang likewise uses -fopenmp
to set up for and link with its own libomp/libiomp. ABI compatibility may be sufficient so that use of *either' library works at runtime, but only as long as symlink trickery ensures that both names actually lead to the same object. If that is not the case, both libgomp and libomp appear to get loaded, probably overwriting each other's symbol tables and local variables.
(Seems there briefly was a concept of supporting -fopenmp=libgomp
to ensure actual linking with the GNU implementation but by all accounts it is a no-op).
I suspect this is the case in your failing builds as well, except that you do not get to see the "convenient" warning as your library versions are properly matched.
No, we use libomp.dylib
exclusively and libgomp.1.dylib
is a symlink to libomp.dylib
Sure there is no libgomp.dylib in your CI ? Crashing somewhere in the OpenMP library would seem to be a plausible explanation fro the lack of symbols even after building with -g
(much as I hate the scenario I laid out above). Maybe I will have to give in and buy a Mac specifically for debugging OpenBLAS, there are just too many unknowns in the OSX ecosystem for me to tackle this with remote CI builds alone.
Maybe I will have to give in and buy a Mac specifically for debugging OpenBLAS, there are just too many unknowns in the OSX ecosystem for me to tackle this with remote CI builds alone.
Hmm, let me see how I can get you access to a Mac.
I can't reproduce this on an ivybridge macos.
Would match my experience building for the CI target (Sandybridge as far as OpenBLAS is concerned) instead of DYNAMIC_ARCH. No idea yet why that would make a difference (nor am I aware of any 0.3.14 changes that would affect OpenMP)
Could reproduce on a friend's sandybridge mac.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x7)
* frame #0: 0x0000000000000007
frame #1: 0x0000000100025c98 xscblat2`cblas_sgemv(order=<unavailable>, TransA=<unavailable>, m=2, n=1, alpha=<unavailable>, a=0x00007ffeefbf32a0, lda=3, x=0x00007ffeefbee850, incx=1, beta=<unavailable>, y=0x00007ffeefbeec70, incy=1) at gemv.c:201 [opt]
frame #2: 0x000000010001f3e3 xscblat2`csgemv_(order=<unavailable>, transp=<unavailable>, m=0x00007ffeefbedb68, n=0x00007ffeefbedb78, alpha=0x00007ffeefbedb0c, a=0x00007ffeefbf32a0, lda=0x00007ffeefbedb50, x=0x00007ffeefbee850, incx=0x00007ffeefbedb28, beta=0x00007ffeefbedb10, y=0x00007ffeefbeec70, incy=0x00007ffeefbedb30) at c_sblas2.c:0 [opt]
frame #3: 0x000000010001b412 xscblat2`schk1_ at c_sblat2.f:655
frame #4: 0x000000010001e615 xscblat2`MAIN__ at c_sblat2.f:319
frame #5: 0x0000000102d3c688 xscblat2`main at c_sblat2.f:455
If I remove DYNAMIC_ARCH=1
, segfault is gone.
TARGET=PRESCOTT DYNAMIC_LIST=PRESCOTT DYNAMIC_ARCH=1
fails
TARGET=PRESCOTT DYNAMIC_ARCH=0
succeeds
Is it intentional that ctest/c_sblas2.c
uses int
instead of blasint
?
No, but I'm pretty sure these tests were simply borrowed from the reference BLAS at some time in the (probably distant) past where blasint
did not exist. (valgrind has turned up some oddities in the meantime, but again nothing from recent changes, maybe this occurring now is just a coincidence of the compilers having become smart enough to do dangerous things to fragile code)
I no longer think the int/blasint issue has any bearing on the segfault, but I do not yet know what has. With separate libgomp and libomp on Linux, valgrind tells me that the LLVM libomp is making a call to sched_setaffinity with a NULL mask in response to the omp_get_max_threads() query in num_cpu_avail (common_thread.h) - probably a sign of the two OpenMP implementations colliding, or at least one not being initialized correctly. For TARGET=SANDYBRIDGE, not other error is recorded, while for the older PRESCOTT and CORE2 targets an illegal read of 8 bytes beyond the actual data happens in their SSE ssymv_U kernel. (Neither of these is fatal in normal operations)
Finally got a working valgrind for OSX (via brew tap LouisBrunner/valgrind) but did not more information than what was in your lldb backtrace:
2021-04-13T14:37:04.7064630Z ==54809== Jump to the invalid address stated on the next line
2021-04-13T14:37:04.7070760Z ==54809== at 0x6: ???
2021-04-13T14:37:04.7072620Z ==54809== by 0x100022403: csgemv_ (c_sblas2.c:31)
2021-04-13T14:37:04.7073920Z ==54809== by 0x10001D969: schk1_ (c_sblat2.f:655)
2021-04-13T14:37:04.7076220Z ==54809== by 0x1000212C1: MAIN__ (c_sblat2.f:319)
2021-04-13T14:37:04.7081440Z ==54809== by 0x1000221C4: main (c_sblat2.f:455)
2021-04-13T14:37:04.7083720Z ==54809== Address 0x6 is not stack'd, malloc'd or (recently) free'd
2021-04-13T14:37:04.7491950Z ==54809== Process terminating with default action of signal 11 (SIGSEGV)
2021-04-13T14:37:04.7493870Z ==54809== at 0x6: ???
2021-04-13T14:37:04.7495920Z ==54809== by 0x100022403: csgemv_ (c_sblas2.c:31)
2021-04-13T14:37:04.7498490Z ==54809== by 0x10001D969: schk1_ (c_sblat2.f:655)
2021-04-13T14:37:04.7500500Z ==54809== by 0x1000212C1: MAIN__ (c_sblat2.f:319)
2021-04-13T14:37:04.7502170Z ==54809== by 0x1000221C4: main (c_sblat2.f:455)
which is looking as if the function address of cblas_sgemv is corrupted (and similarly for the CBLAS3 test, it claims that in interface/gemm.c line 437 the respective gemm_XX function pointer is NULL)
Thanks a lot @martin-frbg for working hard to figure this one out!
Actually it appears to be the function pointer to the SSCAL kernel that gets trashed "at some point" in the CBLAS2 test (the SCAL_K in interface/gemv.c that maps to gotoblas->scal_k_SANDYBRIDGE) , so it is the first invocation of SGEMV with a beta not equal to 1 that crashes. No idea yet why/where it gets overwritten in xscblat2, but the problem does not occur with gcc+gfortran so could be a clang or interoperability issue.
It is very likely that the current package version for this feedstock is out of date. Notes for merging this PR:
license_file
is packagedNote that the bot will stop issuing PRs if more than 3 Version bump PRs generated by the bot are open. If you don't want to package a particular version please close the PR.
NEW: If you want these PRs to be merged automatically, make an issue with code>@conda-forge-admin,</code
please add bot automerge
in the title and merge the resulting PR. This command will add our new bot automerge feature to your feedstock!If this PR was opened in error or needs to be updated please add the
bot-rerun
label to this PR. The bot will close this PR and schedule another one. If you do not have permissions to add this label, you can use the phrase code>@<space/conda-forge-admin, please rerun bot in a PR comment to have theconda-forge-admin
add it for you.This PR was created by the regro-cf-autotick-bot. The regro-cf-autotick-bot is a service to automatically track the dependency graph, migrate packages, and propose package version updates for conda-forge. If you would like a local version of this bot, you might consider using rever. Rever is a tool for automating software releases and forms the backbone of the bot's conda-forge PRing capability. Rever is both conda (
conda install -c conda-forge rever
) and pip (pip install re-ver
) installable. Finally, feel free to drop us a line if there are any issues! This PR was generated by https://github.com/regro/autotick-bot/actions/runs/662380783, please use this URL for debuggingHere is a list of all the pending dependencies (and their versions) for this repo. Please double check all dependencies before merging.
Dependency Analysis
We couldn't run dependency analysis due to an internal error in the bot. :( Help is very welcome!