NNPDF / eko

Evolution Kernel Operators
https://eko.readthedocs.io
GNU General Public License v3.0
6 stars 2 forks source link

Segmentation fault when running benchmark after rustification #381

Open tgiani opened 1 month ago

tgiani commented 1 month ago

When running benchmarcks after rustification I get a segmentation fault error

To Reproduce

  1. ./rustify.sh
  2. poe compile
  3. poe lha -m "ffns and lo and not sv"

Expected behavior On my machine the code crashes with Fatal Python error: Segmentation fault

alecandido commented 1 month ago

This is for sure related to #369, but in these cases is better to give the exact commit to make the issue reproducible.

Rust is pretty much aggressive in preventing memory issues, and Python code on its own can not cause Segmentation fault (it should come from an underlying compiled extension).

So, most likely the problem is generated in some unsafe code, and it should be all in a single file: https://github.com/NNPDF/eko/blob/b7d4e8d3217f87597984af553825ad7f15d4739e/crates/eko/src/lib.rs#L12 https://github.com/NNPDF/eko/blob/b7d4e8d3217f87597984af553825ad7f15d4739e/crates/eko/src/lib.rs#L77 https://github.com/NNPDF/eko/blob/b7d4e8d3217f87597984af553825ad7f15d4739e/crates/eko/src/lib.rs#L132 https://github.com/NNPDF/eko/blob/b7d4e8d3217f87597984af553825ad7f15d4739e/crates/eko/src/lib.rs#L166

(it's the code used to bridge the Rust callables with the SciPy quadrature integration)

felixhekhorn commented 1 month ago

@tgiani can you please check on master?


@niclaurenti @andreab1997 @giacomomagni can someone with a Mac please check this on b7d4e8d?


@alecandido we have some hints that this is actually related to this line (which is master) https://github.com/NNPDF/eko/blob/2348778fa7a9688a0a482f7ed0e3baefbd816722/src/eko/evolution_operator/quad_ker.py#L103 i.e. it seems related to the areas definition; the workflow is as follows:

  1. we define the areas in python as a 2D array
  2. the array is passed via FFI to Rust https://github.com/NNPDF/eko/blob/2348778fa7a9688a0a482f7ed0e3baefbd816722/src/eko/evolution_operator/__init__.py.patch#L641-L643
  3. the array is passed untouched back to python https://github.com/NNPDF/eko/blob/2348778fa7a9688a0a482f7ed0e3baefbd816722/crates/eko/src/lib.rs#L61
  4. the array is read back in python with the line above
tgiani commented 1 month ago

@felixhekhorn I have the same problem on master

alecandido commented 1 month ago

Here there are a bunch of raw pointers around, due to SciPy interface

However, in principle every structure should have a C-compatible representation, and every function should respect the C calling convention. So, if everything is compiled in a compatible way, and linked together, it's supposed to work.

However, I fear it could somehow be a Mac-related problem, or even a Conda-related one, so I'm trying to test on both Linux and MacOS (but I'm using Nix, not Conda, so let's see what will be the outcome).

tgiani commented 1 month ago

@alecandido I m not using conda here, just poetry, so probably that s not the problem

alecandido commented 1 month ago

On 1a43a7d4c3d72f312c2ea71c2611342b6f390dc0 (current #369), I'm actually passing on Linux, while on MacOS I'm getting the following error:

benchmarks/lha_paper_bench.py  ────────────────────────────────────────
  Theories: 1 OCards: 1 PDFs: 1 ext: LHA
 ────────────────────────────────────────
Computing for theory=88d687f, ocard=8beae75 and pdf=ToyLH ...
Exception ignored in: '<numba.core.cpu.CPUContext object at 0x1656c8cd0>'
Traceback (most recent call last):
  File "/Users/alessandro/Projects/pdf/eko/.venv/lib/python3.11/site-packages/numba/cpython/numbers.py", line 1100, in complex_div
    raise ZeroDivisionError("complex division by zero")
ZeroDivisionError: complex division by zero
Exception ignored in: '<numba.core.cpu.CPUContext object at 0x1656c8cd0>'
Traceback (most recent call last):
  File "/Users/alessandro/Projects/pdf/eko/.venv/lib/python3.11/site-packages/numba/cpython/numbers.py", line 1100, in complex_div
    raise ZeroDivisionError("complex division by zero")
ZeroDivisionError: complex division by zero

and it's getting stuck.

But I've not been able to reproduce the segfault yet.

felixhekhorn commented 1 month ago
  • (the so-called 1B$ mistake)

if you have suggestions for improvements, you're welcome :upside_down_face: however, communicating between languages using a third-party library will always be a hassle (by the way: the Rust quad_vec will have to support such a crazy=real-life use case)

I'm actually passing on Linux

good! which makes me confident that the code is not completely crazy and ...

However, I fear it could somehow be a Mac-related problem

I'm afraid so as well - which makes it hard for me to debug :see_no_evil: it is very strange that you get a different error and more over a different error type, i.e. something that hints at a math problem ... (or maybe it is the same and you read some random numbers from somewhere?)

and it's getting stuck.

as I fell myself into that trap: remember to debug with a single core

alecandido commented 1 month ago

if you have suggestions for improvements, you're welcome 🙃 however, communicating between languages using a third-party library will always be a hassle (by the way: the Rust quad_vec will have to support such a crazy=real-life use case)

Well, consider that JSON is capable to represent all the types that you'd like to have (strings, integers, floats, together with composition as unbounded lists and maps) is perfectly represented in the Rust type system, without the need of escaping to a void* hack. Otherwise, a general JSON loader would be impossible (though you may want to know your type at compile time anyhow, if possible: that's the magic of generics).

alecandido commented 1 month ago

I'm afraid so as well - which makes it hard for me to debug 🙈 it is very strange that you get a different error and more over a different error type, i.e. something that hints at a math problem ... (or maybe it is the same and you read some random numbers from somewhere?)

Yeah, when it becomes platform-dependent is immediately more complex, not only because of the availability of devices (that's the main motivation behind my laptop, but not sufficient to solve problems in a snap...). Unfortunately, just trying here might not be enough, and we may have to start diving into the debugger, finding an efficient way to cross the language boundary[*] (possibly, with two different debug sessions at hand). I've never done something like that: I often try to decouple the native library and the Python run with a thin wrapper - but here the wrapper itself can not be thin. But we're not alone (NumPy and Python itself should be debugging in the same conditions), so, there should be a proper way (we could try to read their developers' guides).

[*]: unless we come with a clever intuition (if you know the solution, you can always just apply it)

as I fell myself into that trap: remember to debug with a single core

I had this suspect as well, but unfortunately I tried, changing the value in the lha_paper_bench.py runner:

benchmarks/lha_paper_bench.py --- Python
118 118                     "mugrid": [100],
119 119                     "ev_op_iterations": 10,
120 120                     "interpolation_xgrid": lambertgrid(60).tolist(),
... 121                     "n_integration_cores": 1,
121 122                 }
122 123             ],
123 124             ["ToyLH"],
felixhekhorn commented 1 month ago

Well, consider that JSON is capable to represent all the types that you'd like to have

just to say that it is slightly more complicated :upside_down_face:

  1. we are also passing function pointers around which do not fit inside JSON
  2. we are passing through scipy, which - I think - always goes through void * (what else should you do for a general purpose library?)