Python / NVRTC performance (CUDA 12.2+)

FLAMEGPU / FLAMEGPU2

FLAME GPU 2 is a GPU accelerated agent based modelling framework for CUDA C++ and Python

https://flamegpu.com

MIT License

106 stars 22 forks source link

Python / NVRTC performance (CUDA 12.2+) #1118

Open ptheywood opened 1 year ago

ptheywood commented 1 year ago

Recent runs of the python test suite (CUDA 12.0, 535.104.05, Python 3.12) took a significant length of time to run under linux

650 passed, 11 skipped, 69 warnings in 3080.89s (0:51:20)

A second run, which uses the jitify cache / python caches was significantly faster (965x)

650 passed, 11 skipped, 69 warnings in 3.19s

This was a manylinux based wheel, so SEATBELTS=ON, GLM=OFF.

We should probably investigate this if we are going to push the python side more thoroughly, 50 mins of jitting for 3s of total runtime is bad (the test suite is more or less worst-case compilation vs model runtime, but its pretty bad).

Best guess is that nvrtc has got slower with CUDA 12.x, which compounds into a very long time? but would need to investigate to know for certain (profile the test suite / compare different cuda versions). Just runnign a python example with -t -v might be enough for a quick confirmation if its rtc time or not (with differnt CUDAs)

ptheywood commented 1 year ago

This looks like an nvrtc perf regression within CUDA 12.2.

Using python_rtc/boids_spatial3D_bounded/boids_spatial3D.py with -t -v -s 1, purging the jitify cache between runs:

Wheel CUDA	loaded CUDA (.so's)	RTC Time (s)
12.0	12.2	33.501999
12.0	12.1	3.763000
12.0	12.0	3.800000
11.2	12.2	34.901001
11.2	12.1	3.987000
11.2	12.0	4.092000
11.2	11.8	4.060000
11.2	11.2	2.218000

It's not impacted by the CUDA 12.2 change to Lazy loading (didn't think it would be relevant, but tested via CUDA_MODULE_LOADING=EAGER just in case).

For now, we can probably just use CUDA 12.1, but we might want to try and narrow this down further (test a jitify example / native nvrtc example) and report this to nv.

ptheywood commented 1 year ago

CUDA 12.3 build with 12.3 at runtime had an RTC processing time of 20.773s, with driver 545.23.06, so its still painful but not quite as bad.

With 545.23.06 and python 3.10

Wheel CUDA	loaded CUDA (.so's)	RTC Time (s)
12.3	12.3	20.773001
12.0	12.3	23.533001
12.0	12.2	23.684000
12.0	12.1	3.815000

So driver update / diff python seems to have helped, but perf is still bad.

ptheywood commented 1 year ago

Confirmed this is not hardware specific, running on a Titan V, compiled with CUDA 12.0 and driver 545.23.06

module load CUDA/12.0
cmake .. -DCMAKE_CUDA_ARCHITECTURES="70" -DFLAMEGPU_RTC_DISK_CACHE=OFF 
cmake --build . --target rtc_boids_spatial3D -j 8

Executed using CUDA 12.0+, only single run so not perfect, but the difference is clear.

module load CUDA/12.0
./bin/Release/rtc_boids_spatial3D -s 1 -t

CUDA	RTC time (s)
12.3	33.048
12.2	37.532
12.1	5.634
12.0	5.746

ptheywood commented 8 months ago

Google colab has now update to CUDA 12.2, which makes this issue more prominant to potential FLAME GPU 2 users, with the run_simulation cell now taking ~3-5 minutes for the first run, and ~5 seconds for the second run...

RTC compilation previously would have been ~80s for 16 agent functions.