conda-forge / scipy-feedstock

A conda-smithy repository for scipy.
BSD 3-Clause "New" or "Revised" License
6 stars 36 forks source link

Segfault on pypy on ppc64le #125

Closed isuruf closed 4 years ago

isuruf commented 4 years ago

Not sure if this is a packaging issue or an upstream issue

isuruf commented 4 years ago

For packages using scipy, add the following,

skip: True  # [python_impl=="pypy" and ppc64le]

and do a rerender.

h-vetinari commented 4 years ago

Hey @conda-forge/pypy3-6 @conda-forge/pypy3-5 @conda-forge/scipy.

Since the pypy+ppc64le build was skipped due to the mentioned segfault, we're wondering how this should be handled in libraries downstream of scipy (e.g. conda-forge/osqp-feedstock#22, conda-forge/ecos-feedstock#9, + about a zillion more ;-)). Quoting myself from here at @isuruf's request:

Is it in the interest of the migration to skip all downstream dependencies of scipy on pypy+ppc64le? Wouldn't this then need another migration to catch up? I'm happy to either skip the build or wait a bit to see íf the scipy build can be fixed.

isuruf commented 4 years ago

cc @conda-forge/help-pypy

isuruf commented 4 years ago

Also @conda-forge/bot

beckermr commented 4 years ago

We still do not have a good solution for recipes that cannot be built and block a full subtree of the migration.

mattip commented 4 years ago

I am a bit lost in the PRs and CI runs. Where is a record of the scipy segfault? I have access to a gcc buildfarm ppc64le machine, maybe I can try to reproduce it. Do we know if this is using any BLAS acceleration and what version of gcc is in play?

isuruf commented 4 years ago

Log is at https://travis-ci.com/github/conda-forge/scipy-feedstock/jobs/306968829 This is using the reference lapack implementation (pure fortran). gcc 7.3.0

isuruf commented 4 years ago

Wait, this is gcc 8.2.0

isuruf commented 4 years ago

On second thought that might be the problem. Retrying with -fno-optimize-sibling-calls

mattip commented 4 years ago

Looking at the segfault, it is in _pocketfft. In compilation there is a warning note: the ABI of passing aggregates with 16-byte alignment has changed in GCC 5 for the function 'void pocketfft::detail::PM(T&, T&, T, T) [with T = pocketfft::detail::cmplx<long double>]'. Maybe there is a stack corruption somewhere that PyPy exposes?

mattip commented 4 years ago

Looking up that error, one of the top search results is in the discussion on numpy https://github.com/numpy/numpy/issues/10281. I don't think PyPy knows anything about long double , much less on ppc64le, but maybe there is a problem with _pocketfft somewhere making assumptions about the size of long double?

rgommers commented 4 years ago

@peterbell10 do you know the answer to the above pocketfft question?

peterbell10 commented 4 years ago

pocketfft shouldn't make any assumptions. The SciPy CI on its own covers 8, 12 and 16 byte long double variants without issue. That includes ppcle with CPython.

mattip commented 4 years ago

@isuruf could you add -v to the scipy test run so we can see exactly which test is causing the segfault? On x86 in NumPy we get a nice C stacktrace when there is a segfault, is there something like that on ppc64le? Maybe a -g compiler flag?

mattip commented 4 years ago

That includes ppcle with CPython.

Hmm, maybe we should re-enable the PyPy CI run(s) on scipy/scipy, although they were annoying both because PyPy is slower and they were flaky.

isuruf commented 4 years ago

Here's another run that shows which tests crashed. https://travis-ci.com/github/conda-forge/scipy-feedstock/jobs/302140872#L13359-L13418

mattip commented 4 years ago

Interesting: the crashes seem to mostly be in fft/_pocketfft/tests/test_real_transforms.py, although the error reporting terminated due to too many errors.

mattip commented 4 years ago

I can reproduce a segfault, maybe the same one. It seems to be an actual PyPy bug, since running with --jit off seems to pass the tests. It might be some interaction between pybind11's generated code and the JIT.

isuruf commented 4 years ago

@mattip, is there a way to set --jit off using a env variable so that I can run pytest and see if that helps?

mattip commented 4 years ago

It seems not. You could alias python to pypy --jit off if you really have alot of time, it is very slow that way. We found a failure in PyPy and I am testing the fix now.

mattip commented 4 years ago

Still segfaulting after a fix in PyPy. It is strange the _pocketfft tests are crashing here when the same tests pass in numpy.

isuruf commented 4 years ago

I can confirm that turning off the jit and running the pocketfft test suite fixes the segfault with the conda package. I'll run the full test suite on a OSU OSL machine and report back.

rgommers commented 4 years ago

It is strange the _pocketfft tests are crashing here when the same tests pass in numpy.

It's not the same code - NumPy version is C, SciPy version is more recent C++

mattip commented 4 years ago

PyPy found a problem, the fix is specific to JITted code on ppc. The fix will be part of tomorrow's nightly and also part of 7.3.1

h-vetinari commented 4 years ago

@mattip How far out is 7.3.1?

@conda-forge/help-pypy @conda-forge/bot I'm guessing it makes sense to wait with the migration of scipy-dependent packages then? Or do you want to skip ppc to get pypy (packages) to users faster on other platforms?

mattip commented 4 years ago

hopefully within two weeks

beckermr commented 4 years ago

then we wait I think

isuruf commented 4 years ago

Or we can port the fix to 7.3.0 if anyone is interested.

isuruf commented 4 years ago

Thanks @mattip. I couldn't run the full test suite on travis due to limits (40 mins for cpython, >50 minutes for pypy). I did run the full test suite locally on a power8 machine and all tests passed with reference lapack. (OpenBLAS is another story)