NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
271 stars 53 forks source link

Dynamic shape host latency is slow #3419

Closed jacobhinkle closed 1 day ago

jacobhinkle commented 6 days ago

It appears that our dynamic shape latency is about 10x slower than expected. These are the times for our three existing benchmarks: Image I verified on my local machine that this is not a measurement error. For example on my machine which is less powerful than the one used to generate that graph, I see the following nsys trace for the adaptive layernorm test: Image

We need to add some instrumentation and determine the cause of the slowdown. Since all three of these tests are slow and they use different schedulers, I suspect it might be a general scheduling utility that got slower.

jacobhinkle commented 6 days ago

We should determine the root cause of the slowdown, but it might also be a good idea to bypass the compile time checks in the kernel re-use case since we know they must all have passed previously if we already accepted these segments.

jacobhinkle commented 6 days ago

I disabled compile time checks in getMaybeHeuristicsFor and now I see this: Image