apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.41k stars 3.5k forks source link

[Python] Segmentation fault occurs on libarrow load when using the pyarrow 17.0.0 arm64 wheel #44342

Open vyasr opened 3 days ago

vyasr commented 3 days ago

Describe the bug, including details regarding any error messages, version, and platform.

Under some very specific set of circumstances, importing pyarrow 17.0.0 from an arm wheel triggers a segmentation fault. The error comes from the jemalloc function background_thread_entry that is statically linked into libarrow.so. I can see libarrow.so being opened via strace, and when I run under gdb I see the following backtrace:

[Detaching after vfork from child process 895]
[New Thread 0xfffe18fff1d0 (LWP 960)]
--Type <RET> for more, q to quit, c to continue without paging--c

Thread 128 "jemalloc_bg_thd" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xfffe18fff1d0 (LWP 960)]
0x0000fffe1b2d2844 in background_thread_entry () from /pyenv/versions/3.12.6/lib/python3.12/site-packages/pyarrow/libarrow.so.1700

(gdb) backtrace
#0  0x0000fffe122f1844 in background_thread_entry () from /pyenv/versions/3.12.6/lib/python3.12/site-packages/pyarrow/libarrow.so.1700
#1  0x0000ffff94a3a624 in start_thread (arg=0xfffe122f17e0 <background_thread_entry>) at pthread_create.c:477
#2  0x0000ffff94b3562c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) bt full
#0  0x0000fffe122f1844 in background_thread_entry () from /pyenv/versions/3.12.6/lib/python3.12/site-packages/pyarrow/libarrow.so.1700
No symbol table info available.
#1  0x0000ffff94a3a624 in start_thread (arg=0xfffe122f17e0 <background_thread_entry>) at pthread_create.c:477
        ret = <optimized out>
        pd = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {281466655208940, 281474517841168, 281474517841166, 281473175642112, 281474517841167, 281466691852256,
                281466655209680, 281466655207888, 281473175646208, 281466655207888, 281466655205808, 118832585594287181, 0, 118832583903213793, 0, 0, 0,
                0, 0, 0, 0, 0}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#2  0x0000ffff94b3562c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
No locals.

This error is quite difficult to reproduce. In addition to only observing this this particular issue with the pyarrow 17.0.0 release (the issue vanishes I downgrade to an earlier version) and only when testing on arm architectures, it is also highly sensitive to the exact order of prior operations. In my application I load multiple Python extension modules before importing pyarrow, and the order of those imports affects whether or not this issue manifests. The cases where the issue arises do manifest reliably, so it is not a flaky error, but simply adding an unrelated extra import or reordering unrelated imports is often sufficient to make the problem vanish. I attempted to rebuild libarrow.so using the same flags used to build the wheel (I can't be sure that I got them all right though, I based my compilation on the flags in https://github.com/apache/arrow/blob/main/ci/scripts/python_wheel_manylinux_build.sh). and then preload the library, but that too caused the segmentation fault to disappear, so it's also unlikely that I can get debug symbols into the build in any useful way. I am attempting to reduce this to an MWE in https://github.com/rapidsai/cudf/pull/17022, but I am not very hopeful in it being reduced all that far.

Component(s)

Python

kou commented 3 days ago

Could you also share thread apply all bt full result?

Is there any other Python extension module that also uses jemalloc?

vyasr commented 3 days ago

The output is quite large, so I've attached it in a file. gdb.txt

None of the extensions that I built use jemalloc, but it's possible that something else being loaded into the environment does (e.g. numpy or scipy).

kou commented 3 days ago

Thanks but sorry. I couldn't find any hints in the thread apply all bt full result...

pitrou commented 2 days ago

Hi @vyasr , jemalloc_bg_thd is a jemalloc thread. When searching online, there seem to be issues with jemalloc on Linux aarch64, see https://github.com/jemalloc/jemalloc/issues/467 for example.

I would recommend you switch to mimalloc instead of jemalloc, see https://arrow.apache.org/docs/cpp/memory.html#default-memory-pool

Note that mimalloc becomes the default in 18.0.0 as well (see #43254).

On our side, perhaps we should simply disable jemalloc on Linux aarch64 wheels? @raulcd

vyasr commented 2 days ago

Thanks but sorry. I couldn't find any hints in the thread apply all bt full result...

No problem @kou, I know these kinds of issues can be a huge pain to track down, especially from this limited information.

If it helps, you can see the error in this GHA run on this PR.

When searching online, there seem to be issues with jemalloc on Linux aarch64, see https://github.com/jemalloc/jemalloc/issues/467 for example.

@pitrou thanks for finding that! That makes sense since it certainly seems like the underlying issue comes from jemalloc and is not arrow-specific.

I would recommend you switch to mimalloc instead of jemalloc

Good idea, at least for testing. I'm testing that now in this GH workflow. The arm wheel-tests-cudf job is the one to look out for, let's see if using mimalloc bypasses the issue. That being said:

Note that mimalloc becomes the default in 18.0.0 as well (see https://github.com/apache/arrow/issues/43254). On our side, perhaps we should simply disable jemalloc on Linux aarch64 wheels?

This seems like the right long-term solution if your suggestion to try mimalloc works for me above. pyarrow is a common enough dependency that a user could end up having pyarrow loaded in their environment without even realizing it, and if the import alone is sufficient to trigger the seg fault it would be quite challenging for the average user to debug. Making mimalloc the default seems sufficient to me since IMHO it's reasonable to expect a user explicitly setting the allocator to recognize this as a potential cause, but I wouldn't be opposed to disabling jemalloc altogether on arm either.

vyasr commented 2 days ago

Hmm, @pitrou I still see segfaults in the job that I linked above. Am I configuring the allocator in the correct way in https://github.com/rapidsai/cudf/pull/17022/commits/635b5e0eceeb63460bd28c1d3655b6bd83a49cc1? If so, that suggests that there is an issue with jemalloc that occurs by simply loading the relevant parts of the binary even if no allocation subroutine is invoked, in which case building aarch64 wheels without jemalloc is definitely the way to go because this is beyond the realm of user configuration.

kou commented 2 days ago

It seems that the https://github.com/jemalloc/jemalloc/issues/467 problem was solved by https://github.com/apache/arrow/pull/10940 .

kou commented 2 days ago

Could you try nightly wheel that use mimalloc by default? https://arrow.apache.org/docs/developers/python.html#installing-nightly-packages

pitrou commented 1 day ago

If so, that suggests that there is an issue with jemalloc that occurs by simply loading the relevant parts of the binary even if no allocation subroutine is invoked, in which case building aarch64 wheels without jemalloc is definitely the way to go because this is beyond the realm of user configuration.

Ah, that might be the case indeed, if the crash occurs right when importing PyArrow :(

raulcd commented 1 day ago

@vyasr is there any way to validate the issue has gone away with the nightly wheels? https://anaconda.org/scientific-python-nightly-wheels/pyarrow/files

vyasr commented 1 day ago

I am happy to test out a nightly wheel, but unfortunately I'm not confident that it will tell us anything conclusive. As I mentioned above, in my use case I had a lot of difficulty constructing a true MWE because even small changes like defining a new variable, moving around my imports, or moving imports from one file into another but preserving the order (which still has some effect due to the logic for loading the importing module itself) were sufficient to change whether the error appeared or not, which suggests that some sort of process memory corruption is occurring when the DSO is loaded. As a result, since I assume the nightly wheels will have accumulated many changes since the 17.0.0 release, even if I don't observe the same error it may just be that the error is now simply being hidden by other changes. I can try a few different iterations with different modifications to my scripts to see what happens, though.

pitrou commented 1 day ago

Er, are you telling us that it's not simply import pyarrow that triggers the crash?

vyasr commented 1 day ago

If you're asking whether python -c "import pyarrow" will trigger the crash, then no, that does not crash for me. Quoting from above:

This error is quite difficult to reproduce. In addition to only observing this this particular issue with the pyarrow 17.0.0 release (the issue vanishes I downgrade to an earlier version) and only when testing on arm architectures, it is also highly sensitive to the exact order of prior operations. In my application I load multiple Python extension modules before importing pyarrow, and the order of those imports affects whether or not this issue manifests. The cases where the issue arises do manifest reliably, so it is not a flaky error, but simply adding an unrelated extra import or reordering unrelated imports is often sufficient to make the problem vanish.

None of the modules that I directly control do any sort of relevant stateful initialization on import, but I cannot guarantee that the same is true for the other modules, so it is entirely possible that something in the stack (e.g. scipy) is doing some sort of initialization of a memory pool that introduces conflicting jemalloc symbols, or some other similar problem (it wouldn't actually be a symbol collision since IIUC libarrow does not make any of its jemalloc symbols publicly visible, but that's illustrative of the class of problems I mean). So roughly speaking, I have

import foo
import bar
... # Other imports
import pyarrow # this seg faults

and changing the sequence of import foo and import bar can change whether the seg fault appears.

pitrou commented 1 day ago

So, perhaps there's nothing particular that we should do in PyArrow?

pitrou commented 1 day ago

(at least if you could git bisect and find out when precisely the issue starts happening with PyArrow, that could perhaps give a clue)

vyasr commented 1 day ago

Well OK, to my (pleasant) surprise upgrading to the latest nightly did not make the error vanish (well I suppose not pleasant that I have a seg fault, but at least pleasant that there's something reproducible happening):

root@g242-p33-0009:/repo# python -c "import cupy; import cudf;"
Segmentation fault (core dumped)
root@g242-p33-0009:/repo# python
iPython 3.12.7 (main, Oct  4 2024, 15:35:43) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> pyarrow.__version__
'18.0.0.dev445'
>>> 

The backtrace is the same, still in jemalloc_bg_thd.

So, perhaps there's nothing particular that we should do in PyArrow?

I think compiling out jemalloc or recompiling using the appropriate page size for arm could still make sense. While I haven't been able to reduce my example much further yet, the fact that pyarrow < 17.0.0 works while 17.0.0 and 18 alphas both fail indicate that something meaningful has changed there in the pyarrow binary and anyone could hit it.

(at least if you could git bisect and find out when precisely the issue starts happening with PyArrow, that could perhaps give a clue)

I would be happy to try that, but I would also need to be able to build pyarrow wheels that are equivalent to the build process you have. As I mentioned above

attempted to rebuild libarrow.so using the same flags used to build the wheel (I can't be sure that I got them all right though, I based my compilation on the flags in https://github.com/apache/arrow/blob/main/ci/scripts/python_wheel_manylinux_build.sh). and then preload the library, but that too caused the segmentation fault to disappear

Since the latest pyarrow nightlies fail for me, that suggests that I was indeed not compiling exactly equivalent C++ to what you produce (or perhaps I was but there's also something in the Python build that's relevant since I simply LD_PRELOADed libarrow.so). The nightly index linked above unfortunately doesn't go back far enough for me to install nightlies in between 16.1 and 17 to see where the issue might have arisen.

kou commented 1 day ago

Could you try https://github.com/ursacomputing/crossbow/actions/runs/11285538259#artifacts (download the "wheel" artifact) that disables jemalloc?

pitrou commented 15 hours ago

Also, can you tell us which hardware exactly you're using, and what the default page size is?

And it would be nice if you could try to disassemble at the point of the crash.