Open ptheywood opened 1 year ago
This looks like an nvrtc perf regression within CUDA 12.2.
Using python_rtc/boids_spatial3D_bounded/boids_spatial3D.py
with -t -v -s 1
, purging the jitify cache between runs:
Wheel CUDA | loaded CUDA (.so's) | RTC Time (s) |
---|---|---|
12.0 | 12.2 | 33.501999 |
12.0 | 12.1 | 3.763000 |
12.0 | 12.0 | 3.800000 |
11.2 | 12.2 | 34.901001 |
11.2 | 12.1 | 3.987000 |
11.2 | 12.0 | 4.092000 |
11.2 | 11.8 | 4.060000 |
11.2 | 11.2 | 2.218000 |
It's not impacted by the CUDA 12.2 change to Lazy loading (didn't think it would be relevant, but tested via CUDA_MODULE_LOADING=EAGER
just in case).
For now, we can probably just use CUDA 12.1, but we might want to try and narrow this down further (test a jitify example / native nvrtc example) and report this to nv.
CUDA 12.3 build with 12.3 at runtime had an RTC processing time of 20.773s, with driver 545.23.06, so its still painful but not quite as bad.
With 545.23.06 and python 3.10
Wheel CUDA | loaded CUDA (.so's) | RTC Time (s) |
---|---|---|
12.3 | 12.3 | 20.773001 |
12.0 | 12.3 | 23.533001 |
12.0 | 12.2 | 23.684000 |
12.0 | 12.1 | 3.815000 |
So driver update / diff python seems to have helped, but perf is still bad.
Confirmed this is not hardware specific, running on a Titan V, compiled with CUDA 12.0 and driver 545.23.06
module load CUDA/12.0
cmake .. -DCMAKE_CUDA_ARCHITECTURES="70" -DFLAMEGPU_RTC_DISK_CACHE=OFF
cmake --build . --target rtc_boids_spatial3D -j 8
Executed using CUDA 12.0+, only single run so not perfect, but the difference is clear.
module load CUDA/12.0
./bin/Release/rtc_boids_spatial3D -s 1 -t
CUDA | RTC time (s) |
---|---|
12.3 | 33.048 |
12.2 | 37.532 |
12.1 | 5.634 |
12.0 | 5.746 |
Google colab has now update to CUDA 12.2, which makes this issue more prominant to potential FLAME GPU 2 users, with the run_simulation
cell now taking ~3-5 minutes for the first run, and ~5 seconds for the second run...
RTC compilation previously would have been ~80s for 16 agent functions.
Recent runs of the python test suite (CUDA 12.0, 535.104.05, Python 3.12) took a significant length of time to run under linux
A second run, which uses the jitify cache / python caches was significantly faster (965x)
This was a manylinux based wheel, so SEATBELTS=ON, GLM=OFF.
We should probably investigate this if we are going to push the python side more thoroughly, 50 mins of jitting for 3s of total runtime is bad (the test suite is more or less worst-case compilation vs model runtime, but its pretty bad).
Best guess is that nvrtc has got slower with CUDA 12.x, which compounds into a very long time? but would need to investigate to know for certain (profile the test suite / compare different cuda versions). Just runnign a python example with
-t -v
might be enough for a quick confirmation if its rtc time or not (with differnt CUDAs)