OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.14k stars 1.46k forks source link

`bench_blas.py::test_gesdd` in Codspeed CI job #4776

Open rgommers opened 6 days ago

rgommers commented 6 days ago

This failure just showed up in CI, from this log:

 benchmarks/bench_blas.py ............................................... [ 75%]
  .........F.....                                                          [100%]
  ================================ 62 benchmarked ================================

  =================================== FAILURES ===================================
  ______________________________ test_gesdd[mn1-s] _______________________________

  benchmark = <pytest_codspeed.plugin.BenchmarkFixture object at 0x152f7980>
  mn = (1000, 222), variant = 's'

      @pytest.mark.parametrize('variant', ['s', 'd'])
      @pytest.mark.parametrize('mn', gesdd_sizes)
      def test_gesdd(benchmark, mn, variant):
          m, n = mn
          rndm = np.random.RandomState(1234)
          dtyp = dtype_map[variant]

          a = np.array(rndm.uniform(size=(m, n)), dtype=dtyp, order='F')

          gesdd_lwork = ow.get_func('gesdd_lwork', variant)

          lwork, info = gesdd_lwork(m, n)
          lwork = int(lwork)
          assert info == 0

          gesdd = ow.get_func('gesdd', variant)
          u, s, vt, info = benchmark(run_gesdd, a, lwork, gesdd)

  >       assert info == 0
  E       assert 12 == 0

  benchmarks/bench_blas.py:237: AssertionError
  =========================== short test summary info ============================
  FAILED benchmarks/bench_blas.py::test_gesdd[mn1-s] - assert 12 == 0
  ================== 1 failed, 61 passed in 2080.74s (0:34:40) ===================
   ** On entry to SLASCL parameter number  4 had an illegal value
   ** On entry to SLASCL parameter number  4 had an illegal value
   ** On entry to SLASCL parameter number  4 had an illegal value
   ** On entry to SLASCL parameter number  4 had an illegal value

This benchmark was introduced in gh-4678 one and a half months ago. I'm not sure it failed before like this. @ev-br you may want to look into this?

ev-br commented 6 days ago

Hm... Actually, it did fail like this before: https://github.com/OpenMathLib/OpenBLAS/actions/workflows/codspeed-bench.yml And the single-precision variant has been introduced in https://github.com/OpenMathLib/OpenBLAS/pull/4763, so it's relatively recent. Hm. Am taking a look.

ev-br commented 6 days ago

A few quick observations:

2) the mistake is almost certainly a failure of thread safety in the LAPACK implementation or in a Julia wrapper. I suspect someone is misusing a static or global variable.

The benchmark is using OPENBLAS_NUM_TREADS=1, so it's not thread safety but it's still deep in OpenBLAS.

All in all, this looks like a possibly genuine edge case in OpenBLAS (or reference LAPACK? not sure where the single-precision gesdd kernel comes from)

I'll try to smoke it out on CI without codspeed now. Meanwhile, maybe we should just remove the assertion for the time being. The assertion is not required for the benchmark itself; it's just generally nicer to benchmark correctly working code, they say.

EDIT: https://github.com/OpenMathLib/OpenBLAS/pull/4777

martin-frbg commented 6 days ago

Weird error, INFO=4 would mean this input argument (the denominator of the scale factor) is either NAN or zero

martin-frbg commented 6 days ago

There is one call graph where this input argument is provided by SNRM2...

ev-br commented 6 days ago

This is indeed codspeed-specific: https://github.com/OpenMathLib/OpenBLAS/actions/runs/9765188405/job/26955309472?pr=4777

I asked on their discord (https://discord.com/channels/1065233827569598464/1065686090452828251/threads/1257753281342738502 -- might need to join to see, no idea; will repost any insights here anyway).

Meanwhile, do we want to disable the assertion so that the runs are uploaded and we can at least see the flamegraphs?

martin-frbg commented 5 days ago

Guess it would make sense to disable the assertion, especially if the problem cannot be reproduced outside codspeed. (This is unlikely to be cpu-specific as there is only one assembly kernel for SNRM2 in use on all x86_64 targets (except the plain C "GENERIC" one). Or if it is, it would have to be due to compiler/assembler misbehaviour) Also the NRM2 code path in OpenBLAS runs single-threaded in any case, and the Reference-LAPACK codebase is not multithreaded except for a handful of routines that use OpenMP parallelization if compiled for it.

ev-br commented 5 days ago

okay, I repurposed https://github.com/OpenMathLib/OpenBLAS/pull/4777 to ignore the sgesdd failure.