Open rgommers opened 6 days ago
Hm... Actually, it did fail like this before: https://github.com/OpenMathLib/OpenBLAS/actions/workflows/codspeed-bench.yml And the single-precision variant has been introduced in https://github.com/OpenMathLib/OpenBLAS/pull/4763, so it's relatively recent. Hm. Am taking a look.
A few quick observations:
2) the mistake is almost certainly a failure of thread safety in the LAPACK implementation or in a Julia wrapper. I suspect someone is misusing a static or global variable.
The benchmark is using OPENBLAS_NUM_TREADS=1, so it's not thread safety but it's still deep in OpenBLAS.
develop
branch. All in all, this looks like a possibly genuine edge case in OpenBLAS (or reference LAPACK? not sure where the single-precision gesdd kernel comes from)
I'll try to smoke it out on CI without codspeed now. Meanwhile, maybe we should just remove the assertion for the time being. The assertion is not required for the benchmark itself; it's just generally nicer to benchmark correctly working code, they say.
Weird error, INFO=4 would mean this input argument (the denominator of the scale factor) is either NAN or zero
There is one call graph where this input argument is provided by SNRM2...
This is indeed codspeed-specific: https://github.com/OpenMathLib/OpenBLAS/actions/runs/9765188405/job/26955309472?pr=4777
I asked on their discord (https://discord.com/channels/1065233827569598464/1065686090452828251/threads/1257753281342738502 -- might need to join to see, no idea; will repost any insights here anyway).
Meanwhile, do we want to disable the assertion so that the runs are uploaded and we can at least see the flamegraphs?
Guess it would make sense to disable the assertion, especially if the problem cannot be reproduced outside codspeed. (This is unlikely to be cpu-specific as there is only one assembly kernel for SNRM2 in use on all x86_64 targets (except the plain C "GENERIC" one). Or if it is, it would have to be due to compiler/assembler misbehaviour) Also the NRM2 code path in OpenBLAS runs single-threaded in any case, and the Reference-LAPACK codebase is not multithreaded except for a handful of routines that use OpenMP parallelization if compiled for it.
okay, I repurposed https://github.com/OpenMathLib/OpenBLAS/pull/4777 to ignore the sgesdd failure.
This failure just showed up in CI, from this log:
This benchmark was introduced in gh-4678 one and a half months ago. I'm not sure it failed before like this. @ev-br you may want to look into this?