Spurious coverage changes reported on PRs

jakelishman commented 3 years ago

Information

Qiskit Terra version: main
Python version: any
Operating system: any

What is the current behavior?

Coveralls will often indicate spurious changes in coverage, due to randomness, or protective code not needing to run. For example, this comment indicates a change in qiskit/pulse/library/waveform.py, even though the PR didn't change it or any of its tests.

Steps to reproduce the problem

Make a PR, and look at Coveralls' comments (if you're unlucky).

What is the expected behavior?

Test coverage should be deterministic and repeatable every single test run.

Suggested solutions

Code only included to smooth over dodgy test VMs and things like this can be added to ignore lists. This includes some of the code in qiskit/test, but not much. Lines that are frequently dropped due to randomisation changes should gain more tests, ensuring that the flaky paths are targeted with direct tests, regardless.

jakelishman commented 3 years ago

Just collecting other times when I see spurious coverage changes, to try and make sure we're seeing all files:

https://coveralls.io/builds/44092315
https://coveralls.io/builds/44175038 (in quantumcircuit.py, the type hint overload stubs have function bodies that say only .... These shouldn't count as missing coverage.)

Mercurial files:

qiskit/extensions/quantum_initializer/uc.py
qiskit/pulse/library/waveform.py

javabster commented 3 years ago

the coveralls github action generates coverage reports by running tox -ecoverage so we should edit our coverage3 configs for this. I don't think we have a dedicated coverage config file with any existing ignore lists, so would probably need to add it in tox.ini. Having a quick look at the coverage.py docs it seems we could add a regex based exclude_lines list, and/or add comments (# pragma: no cover) in the code itself to ignore specific lines. Based on your findings are there any common regex we could add to an ignore list?

levbishop commented 3 years ago

I think more likely the main problem with the fluctuations is residual randomness in the tests. We've been gradually adding explicit seeds to the worst-offending tests outside of randomized/ subdirectory but I'm sure there's plenty that were missed.

Adding exclude_lines and pragmas I think would be separately useful for getting to 100% coverage by avoiding "false positives" - for example there's no point writing a test for a stub method raising returning NotImplementedError or (as jake mentioned) type stubs. https://github.com/asottile/covdefaults is a plugin that sets some sensible-looking defaults to ignore kind of thing.

jakelishman commented 3 years ago

Yeah, the type hinting stubs thing isn't really the same thing as the other spurious coverage changes, but that repo Lev's linked looks good for it. I don't know if we need the other things in that plugin though, so we could just copy over their defaults into our pyproject.toml rather than adding another dependency.

It's probably going to be impossible to eliminate all randomness from tests and keep things that way, but we can also write some extra tests that specifically test both branches of flaky bits of code.

1ucian0 commented 2 years ago

test.python.pulse.test_pulse_lib.TestWaveform.test_pulse_limits seems like nondeterministic. It can be reproduced as:

for i in  `seq 10`; do python -m unittest  -v test.python.pulse.test_pulse_lib.TestWaveform.test_pulse_limits ; done

Some of the executions error with:

qiskit.pulse.exceptions.PulseError: 'Pulse contains sample with norm 1.0000000000000002 greater than 1+epsilon. This can be overruled by setting Pulse.limit_amplitude.'

hristog commented 2 years ago

Hi @1ucian0 - I've been reading the discussion here, but I've been unable to reproduce your test results. I ran the same test 1500 times and it produced no failures.

Initially, I was wondering whether the class variable limit_amplitude could've introduced some flakyness, due to it being modified in different tests, and also several times within the same test (in general, class variables, defined and modified in such way, are a nightmare to test sanely, due to coupling with order of execution etc.), but then I read the discussion further and realised the failures occurred even in isolation.

Could something have been updated elsewhere, since your comment was posted? If you wouldn't mind, could you, please, try to reproduce your previous results? Another possible explanation might be difference in NumPy versions (resulting in different floating-point results) due to how the version requirement is expressed.

Update: I've attempted to reproduce (but have been unable to do so) the failures with different versions of NumPy (1.17.0, 1.19.0, 1.20.1, 1.21.6), and also via testing the entire test_pulse_lib.py module with both unittest and pytest.

Thanks!

1ucian0 commented 2 years ago

I can still reproduce this issue. Here is my env:

Python 3.9.12
qiskit-terra: 0.21.0.dev0+4982c69
numpy: 1.22.4
scipy: 1.8.1
sympy: 1.10.1

@hristog it does not look like the it depends on the NumPy version. I could reproduce with 1.21.6 and a fresh env. Maybe the Python version?

hristog commented 2 years ago

@1ucian0, thanks a lot for the further details!

I confirm I haven't attempted with Python 3.9.12, nor with different versions of scipy and SymPy. I tried different qiskit-terra commits, including ones from around the time you posted your results, but that didn't result in any test failures either.

I'll try to reproduce with your env versions when I get a chance, and I'll confirm here what my associated thoughts are (if I have any that I feel may contribute constructively to the discussion).

hristog commented 2 years ago

Hi again, @1ucian0 - I'm still unable to reproduce (even with the most recent env you've conveniently specified).

Just to clarify - I'm attempting to reproduce this issue via

for i in  `seq 10`; do python -m unittest  -v test.python.pulse.test_pulse_lib.TestWaveform.test_pulse_limits ; done

as per the linked comment. By not having been able to reproduce the issue, I mean I haven't been able to run into any test failures whatsoever.

jakelishman commented 2 years ago

It's possible it's related to version of BLAS in use by Numpy, or by the particular CPU instructions available (SIMD vs non-SIMD) since it looks like we're 1 ULP away from tolerance. If so, it's possibly not going to be reproducible for you, @hristog.

For what it's worth, I've got a Mac with Python 3.9, Numpy 1.22.3 and the following BLAS/CPU instructions:

In [8]: np.show_config()
openblas64__info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
    not found = AVX512_KNL

and I can also reproduce the errors occasionally.

1ucian0 commented 2 years ago

In this script, I tried to add the minimum to reproduce the nondeterminism behaviour this in my system:

for i in  `seq 100`; do python bug_7235.py ; done

....
True
....
False
...

bug_7235.zip

I have the same openblas conf than @jakelishman

1ucian0 commented 2 years ago

The problem seems to be in np.abs, when dealing with complex numbers.

bug_7235.zip

for i in  `seq 100`; do python bug_7235.py ; done

60.00000000000001 True
60.0 False
60.00000000000001 True

jakelishman commented 2 years ago

It's hard to say it's necessarily a bug in Numpy without more study. Certainly the exactly floating-point bit output of Python's built-in abs and Numpy's ufunc absolute are returning values that differ by up to 3 ULP (for this particular input), which definitely isn't ideal, but there's a good chance they're both being accurate to their own specs. I'm not 100% sure how each propose to handle these types, though - we'd need to look into the difference in semantics between raw Python complex and Numpy's np.complex128.

hristog commented 2 years ago

Hi again @1ucian0 @jakelishman!

The BLAS direction has been a good one for exploration, because it indeed establishes differences in our envs (also on a hardware level). Prior to this direction, I was deliberating whether to create a Dockerfile to enable a more reproducible approach (as I suspected something might've differed in my environment in terms of how Python or Qiskit Terra have been built). For completeness, my np.show_config() output is as follows (from two different machines, one running Ubuntu and the other one - WSL2):

blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]

Regarding @1ucian0's example from this comment, as suspected, I'm getting 1000 Falses out of a 1000.

jakelishman commented 2 years ago

I can reproduce the failures called the absolute ufunc on single elements, which means BLAS probably shouldn't be involved, but for complex amplitudes, it's quite possible that some SIMD trickery is accounting for the minor floating-point differences.

If I were guessing, I'd reckon that either Numpy or Python is specialising the absolute ufunc down to something that uses a fused multiply-add instruction and the other isn't. Those have different rounding behaviours at the FPU level (fused add-multiply rounds only once, but multiply-then-add involves two separate FP operations, so rounds twice), and since the differences seem to involve 1 ULP (before the difference between two values), that looks like it lines up to me.

I know Numpy does some specialised magic based on the available CPU instructions, but we could all be using Pythons compiled for different instruction sets as well, so I don't know which way round it is.

Either way, the most likely solution is probably to relax the checker to allow an extra tolerance of 1 ULP in the norm checker. Since we're comparing to a norm of 1.0, 1 ULP is easily accessible as sys.float_info.epsilon, so we could change the test to clip = possible < 1.0 + epsilon + sys.float_info.epsilon or something.

jakelishman commented 2 years ago

There's a logical square root after the real * real + imag * imag operation in an abs calculation, but sqrt(1 + x) == 1 + 0.5 * x if x <= 2 * float_epsilon (there are extra non-zero terms if it differs by more), so I would guess that the sqrt isn't affecting things (if anything, it possibly helps more than hurts).

levbishop commented 2 years ago

I suspect some kind of manifestation of GCC bug 323 behavior (FPU excess precision not stored).

hristog commented 2 years ago

For comparison and completeness purposes, my GCC version is gcc (Debian 8.3.0-6) 8.3.0. It seems to me, however, that the referenecd "Bug 323 - optimized code gives strange floating point results" dates back to as far as 2000. I appreciate in the case here, we may as well be facing a related/derivative GCC bug, not necessarily 323 per se.

hristog commented 2 years ago

Either way, the most likely solution is probably to relax the checker to allow an extra tolerance of 1 ULP in the norm checker. Since we're comparing to a norm of 1.0, 1 ULP is easily accessible as sys.float_info.epsilon, so we could change the test to clip = possible < 1.0 + epsilon + sys.float_info.epsilon or something.

@jakelishman, would the change only be on the unit-tests' side? I could've misunderstood, but it seems to me some changes would to the following method would be necessary: https://github.com/Qiskit/qiskit-terra/blob/f960d05ab664665d2fdcbe4060d9f17b09274a16/qiskit/pulse/library/waveform.py#L56

jakelishman commented 2 years ago

It's definitely not the tests that would need changing, it'd be in the method you've highlighted, yeah. I think we need to ensure that if to_clip triggers, then the absolute values of samples are strictly capped to 1.0, which might mean doing np.clip(np.abs(clipped_samples), 0.0, 1.0).

It would be good to open up a separate issue about this - the potential test failure is different to the original issue here (which was just about flaky coverage).

hristog commented 2 years ago

I've created https://github.com/Qiskit/qiskit-terra/issues/8163 for the latter. It's unclear to me where the non-determinism of test failures - given repeated executions of the same test - originates (assuming no other tests are run at all, as part of each test run).

Qiskit / qiskit