NVIDIA / numba-cuda

BSD 2-Clause "Simplified" License
36 stars 8 forks source link

Use `pynvjitlink` for MVC #23

Closed brandon-b-miller closed 1 month ago

brandon-b-miller commented 3 months ago

This PR attempts to move some of the logic inside the pynvjitlink patch.py to work behind config.CUDA_ENABLE_MINOR_VERSION_COMPATIBILITY such that numba may perform the patch if necessary rather than pynvjitlink itself.

copy-pr-bot[bot] commented 3 months ago

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

brandon-b-miller commented 3 months ago

Thanks for the updates - I added a couple more comments on the diff.

Thank you for the review :) I've made some changes that hopefully address your comments.

brandon-b-miller commented 1 month ago

https://github.com/NVIDIA/numba-cuda/pull/23/commits/7c384a3b202f8e2e93514edb48d5c372b70a5f74 adds the relevant tests from pynvjitlink, which I suppose for a short time may exist in both places while we deprecate the patching api in pynvjitlink itself.

It occurs to me that the tests still must be built on the CI test jobs if we'd like to test them there, which now involves compiling a few files with nvcc. This is a problem I still need to solve in this PR I think.

brandon-b-miller commented 1 month ago

The tests now pass, modulo what appears to be an intermittent error on the conda pynvjitlink test job. I do not seem to have permissions to rerun this job.

gmarkall commented 1 month ago

Running the testsuite on my local system seems to be deadlocking at:

test_lock (numba.cuda.tests.cudapy.test_dispatcher.TestDispatcher.test_lock)
Test that (lazy) compiling from several threads at once doesn't ...

when pynvjitlink is enabled:

ENABLE_PYNVJITLINK=1 python -m numba.runtests numba.cuda.tests -v -m
gmarkall commented 1 month ago

Threads:

(gdb) info threads
  Id   Target Id                                            Frame 
* 1    Thread 0x70df86a90740 (LWP 128088) "python"          __futex_abstimed_wait_common64 (private=<optimised out>, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x70de98395330) at ./nptl/futex-internal.c:57
  2    Thread 0x70df83600640 (LWP 128089) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859ceb60 <thread_status+96>) at ./nptl/futex-internal.c:57
  3    Thread 0x70df82c00640 (LWP 128090) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859cebe0 <thread_status+224>) at ./nptl/futex-internal.c:57
  4    Thread 0x70df82200640 (LWP 128091) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859cec60 <thread_status+352>) at ./nptl/futex-internal.c:57
  5    Thread 0x70df81800640 (LWP 128092) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859cece0 <thread_status+480>) at ./nptl/futex-internal.c:57
  6    Thread 0x70df80e00640 (LWP 128093) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859ced60 <thread_status+608>) at ./nptl/futex-internal.c:57
  7    Thread 0x70df78400640 (LWP 128094) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859cede0 <thread_status+736>) at ./nptl/futex-internal.c:57
  8    Thread 0x70df77a00640 (LWP 128095) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859cee60 <thread_status+864>) at ./nptl/futex-internal.c:57
  9    Thread 0x70df67000640 (LWP 128096) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859ceee0 <thread_status+992>) at ./nptl/futex-internal.c:57
  10   Thread 0x70df66600640 (LWP 128097) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859cef60 <thread_status+1120>) at ./nptl/futex-internal.c:57
  11   Thread 0x70df45c00640 (LWP 128098) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859cefe0 <thread_status+1248>) at ./nptl/futex-internal.c:57
  12   Thread 0x70df45200640 (LWP 128099) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859cf060 <thread_status+1376>) at ./nptl/futex-internal.c:57
  13   Thread 0x70df34800640 (LWP 128100) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859cf0e0 <thread_status+1504>) at ./nptl/futex-internal.c:57
  14   Thread 0x70df33e00640 (LWP 128101) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859cf160 <thread_status+1632>) at ./nptl/futex-internal.c:57
  15   Thread 0x70df23400640 (LWP 128102) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859cf1e0 <thread_status+1760>) at ./nptl/futex-internal.c:57
  16   Thread 0x70df12a00640 (LWP 128103) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859cf260 <thread_status+1888>) at ./nptl/futex-internal.c:57
  17   Thread 0x70df0a000640 (LWP 128104) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859cf2e0 <thread_status+2016>) at ./nptl/futex-internal.c:57
  18   Thread 0x70df01600640 (LWP 128105) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859cf360 <thread_status+2144>) at ./nptl/futex-internal.c:57
  19   Thread 0x70def8c00640 (LWP 128106) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859cf3e0 <thread_status+2272>) at ./nptl/futex-internal.c:57
  20   Thread 0x70def0200640 (LWP 128107) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x70df859cf460 <thread_status+2400>) at ./nptl/futex-internal.c:57
  21   Thread 0x70ded2200640 (LWP 128117) "cuda00001800007" 0x000070df86918bcf in __GI___poll (fds=0x60aeacfb6e10, 
    nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  22   Thread 0x70dec3800640 (LWP 129038) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, 
    abstime=0x0, op=393, expected=0, futex_word=0x60aead59e718) at ./nptl/futex-internal.c:57
  23   Thread 0x70debaa00640 (LWP 129039) "cuda-EvtHandlr"  0x000070df86918bcf in __GI___poll (fds=0x70debc001cf0, 
    nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
  24   Thread 0x70dec4200640 (LWP 129045) "python"          futex_wait (private=0, expected=2, 
    futex_word=0x60aeaccc5d58) at ../sysdeps/nptl/futex-internal.h:146
  25   Thread 0x70debb400640 (LWP 129046) "python"          __futex_abstimed_wait_common64 (private=-1153439408, 
    cancel=true, abstime=0x70debb3fea70, op=137, expected=0, futex_word=0x60aeaaeac194 <_PyRuntime+436>)
    at ./nptl/futex-internal.c:57
  26   Thread 0x70deaec00640 (LWP 129047) "python"          __futex_abstimed_wait_common64 (private=<optimised out>, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
  27   Thread 0x70dead600640 (LWP 129048) "python"          __futex_abstimed_wait_common64 (private=<optimised out>, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
  28   Thread 0x70deacc00640 (LWP 129049) "python"          __futex_abstimed_wait_common64 (private=<optimised out>, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
  29   Thread 0x70de9fe00640 (LWP 129050) "python"          __futex_abstimed_wait_common64 (private=<optimised out>, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
  30   Thread 0x70de9f400640 (LWP 129051) "python"          __futex_abstimed_wait_common64 (private=<optimised out>, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
  31   Thread 0x70de9ea00640 (LWP 129052) "python"          __futex_abstimed_wait_common64 (private=<optimised out>, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
  32   Thread 0x70de9e000640 (LWP 129053) "python"          __futex_abstimed_wait_common64 (private=<optimised out>, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
  33   Thread 0x70de9d600640 (LWP 129054) "python"          __futex_abstimed_wait_common64 (private=<optimised out>, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
  34   Thread 0x70de9cc00640 (LWP 129055) "python"          __futex_abstimed_wait_common64 (private=<optimised out>, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
  35   Thread 0x70de97e00640 (LWP 129056) "python"          __futex_abstimed_wait_common64 (private=<optimised out>, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
  36   Thread 0x70de97400640 (LWP 129057) "python"          __futex_abstimed_wait_common64 (private=<optimised out>, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
  37   Thread 0x70de96a00640 (LWP 129058) "python"          __futex_abstimed_wait_common64 (private=<optimised out>, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
  38   Thread 0x70de93e00640 (LWP 129059) "python"          __futex_abstimed_wait_common64 (private=<optimised out>, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
  39   Thread 0x70de93400640 (LWP 129060) "python"          __futex_abstimed_wait_common64 (private=<optimised out>, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x60aeacd6aea0) at ./nptl/futex-internal.c:57
brandon-b-miller commented 1 month ago

Connected offline with @gmarkall , we concluded the above is probably an issue in nvJitLink/pynvjitlink rather than here.

gmarkall commented 1 month ago

Connected offline with @gmarkall , we concluded the above is probably an issue in nvJitLink/pynvjitlink rather than here.

It was a combination of mismatched versions (cudadevrt from 12.6 with nvJitLink from 12.5) and nvJitLink not handling this situation gracefully.

gmarkall commented 1 month ago

Closing as #56 has just been merged.