Closed brandon-b-miller closed 1 month ago
Thanks for the updates - I added a couple more comments on the diff.
Thank you for the review :) I've made some changes that hopefully address your comments.
https://github.com/NVIDIA/numba-cuda/pull/23/commits/7c384a3b202f8e2e93514edb48d5c372b70a5f74 adds the relevant tests from pynvjitlink, which I suppose for a short time may exist in both places while we deprecate the patching api in pynvjitlink itself.
It occurs to me that the tests still must be built on the CI test jobs if we'd like to test them there, which now involves compiling a few files with nvcc. This is a problem I still need to solve in this PR I think.
The tests now pass, modulo what appears to be an intermittent error on the conda pynvjitlink test job. I do not seem to have permissions to rerun this job.
Running the testsuite on my local system seems to be deadlocking at:
test_lock (numba.cuda.tests.cudapy.test_dispatcher.TestDispatcher.test_lock)
Test that (lazy) compiling from several threads at once doesn't ...
when pynvjitlink is enabled:
ENABLE_PYNVJITLINK=1 python -m numba.runtests numba.cuda.tests -v -m
Threads:
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x70df86a90740 (LWP 128088) "python" __futex_abstimed_wait_common64 (private=<optimised out>,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x70de98395330) at ./nptl/futex-internal.c:57
2 Thread 0x70df83600640 (LWP 128089) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859ceb60 <thread_status+96>) at ./nptl/futex-internal.c:57
3 Thread 0x70df82c00640 (LWP 128090) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859cebe0 <thread_status+224>) at ./nptl/futex-internal.c:57
4 Thread 0x70df82200640 (LWP 128091) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859cec60 <thread_status+352>) at ./nptl/futex-internal.c:57
5 Thread 0x70df81800640 (LWP 128092) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859cece0 <thread_status+480>) at ./nptl/futex-internal.c:57
6 Thread 0x70df80e00640 (LWP 128093) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859ced60 <thread_status+608>) at ./nptl/futex-internal.c:57
7 Thread 0x70df78400640 (LWP 128094) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859cede0 <thread_status+736>) at ./nptl/futex-internal.c:57
8 Thread 0x70df77a00640 (LWP 128095) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859cee60 <thread_status+864>) at ./nptl/futex-internal.c:57
9 Thread 0x70df67000640 (LWP 128096) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859ceee0 <thread_status+992>) at ./nptl/futex-internal.c:57
10 Thread 0x70df66600640 (LWP 128097) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859cef60 <thread_status+1120>) at ./nptl/futex-internal.c:57
11 Thread 0x70df45c00640 (LWP 128098) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859cefe0 <thread_status+1248>) at ./nptl/futex-internal.c:57
12 Thread 0x70df45200640 (LWP 128099) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859cf060 <thread_status+1376>) at ./nptl/futex-internal.c:57
13 Thread 0x70df34800640 (LWP 128100) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859cf0e0 <thread_status+1504>) at ./nptl/futex-internal.c:57
14 Thread 0x70df33e00640 (LWP 128101) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859cf160 <thread_status+1632>) at ./nptl/futex-internal.c:57
15 Thread 0x70df23400640 (LWP 128102) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859cf1e0 <thread_status+1760>) at ./nptl/futex-internal.c:57
16 Thread 0x70df12a00640 (LWP 128103) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859cf260 <thread_status+1888>) at ./nptl/futex-internal.c:57
17 Thread 0x70df0a000640 (LWP 128104) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859cf2e0 <thread_status+2016>) at ./nptl/futex-internal.c:57
18 Thread 0x70df01600640 (LWP 128105) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859cf360 <thread_status+2144>) at ./nptl/futex-internal.c:57
19 Thread 0x70def8c00640 (LWP 128106) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859cf3e0 <thread_status+2272>) at ./nptl/futex-internal.c:57
20 Thread 0x70def0200640 (LWP 128107) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x70df859cf460 <thread_status+2400>) at ./nptl/futex-internal.c:57
21 Thread 0x70ded2200640 (LWP 128117) "cuda00001800007" 0x000070df86918bcf in __GI___poll (fds=0x60aeacfb6e10,
nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
22 Thread 0x70dec3800640 (LWP 129038) "python" __futex_abstimed_wait_common64 (private=0, cancel=true,
abstime=0x0, op=393, expected=0, futex_word=0x60aead59e718) at ./nptl/futex-internal.c:57
23 Thread 0x70debaa00640 (LWP 129039) "cuda-EvtHandlr" 0x000070df86918bcf in __GI___poll (fds=0x70debc001cf0,
nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
24 Thread 0x70dec4200640 (LWP 129045) "python" futex_wait (private=0, expected=2,
futex_word=0x60aeaccc5d58) at ../sysdeps/nptl/futex-internal.h:146
25 Thread 0x70debb400640 (LWP 129046) "python" __futex_abstimed_wait_common64 (private=-1153439408,
cancel=true, abstime=0x70debb3fea70, op=137, expected=0, futex_word=0x60aeaaeac194 <_PyRuntime+436>)
at ./nptl/futex-internal.c:57
26 Thread 0x70deaec00640 (LWP 129047) "python" __futex_abstimed_wait_common64 (private=<optimised out>,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
27 Thread 0x70dead600640 (LWP 129048) "python" __futex_abstimed_wait_common64 (private=<optimised out>,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
28 Thread 0x70deacc00640 (LWP 129049) "python" __futex_abstimed_wait_common64 (private=<optimised out>,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
29 Thread 0x70de9fe00640 (LWP 129050) "python" __futex_abstimed_wait_common64 (private=<optimised out>,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
30 Thread 0x70de9f400640 (LWP 129051) "python" __futex_abstimed_wait_common64 (private=<optimised out>,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
31 Thread 0x70de9ea00640 (LWP 129052) "python" __futex_abstimed_wait_common64 (private=<optimised out>,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
32 Thread 0x70de9e000640 (LWP 129053) "python" __futex_abstimed_wait_common64 (private=<optimised out>,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
33 Thread 0x70de9d600640 (LWP 129054) "python" __futex_abstimed_wait_common64 (private=<optimised out>,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
34 Thread 0x70de9cc00640 (LWP 129055) "python" __futex_abstimed_wait_common64 (private=<optimised out>,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
35 Thread 0x70de97e00640 (LWP 129056) "python" __futex_abstimed_wait_common64 (private=<optimised out>,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
36 Thread 0x70de97400640 (LWP 129057) "python" __futex_abstimed_wait_common64 (private=<optimised out>,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
37 Thread 0x70de96a00640 (LWP 129058) "python" __futex_abstimed_wait_common64 (private=<optimised out>,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
38 Thread 0x70de93e00640 (LWP 129059) "python" __futex_abstimed_wait_common64 (private=<optimised out>,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
39 Thread 0x70de93400640 (LWP 129060) "python" __futex_abstimed_wait_common64 (private=<optimised out>,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
Connected offline with @gmarkall , we concluded the above is probably an issue in nvJitLink
/pynvjitlink
rather than here.
Connected offline with @gmarkall , we concluded the above is probably an issue in
nvJitLink
/pynvjitlink
rather than here.
It was a combination of mismatched versions (cudadevrt from 12.6 with nvJitLink from 12.5) and nvJitLink not handling this situation gracefully.
Closing as #56 has just been merged.
This PR attempts to move some of the logic inside the
pynvjitlink
patch.py to work behindconfig.CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY
such that numba may perform the patch if necessary rather thanpynvjitlink
itself.