NVIDIA / warp

A Python framework for high performance GPU simulation and graphics
https://nvidia.github.io/warp/
Other
4.28k stars 243 forks source link

Timers for the DEM example are confusing #277

Open cadop opened 4 months ago

cadop commented 4 months ago

Bug Description

DEM example becomes extremely slow with higher particle count.

https://github.com/NVIDIA/warp/blob/acfeabcf8f3414e1b131df3b47027dc5789c9113/warp/examples/core/example_dem.py#L136

When using self.points = self.particle_grid(1024, 1024, 256, (0.0, 0.5, 0.0), self.point_radius, 0.1) There is no issue, and each step takes under 0.5 ms.

When using self.points = self.particle_grid(1024, 1024, 512, (0.0, 0.5, 0.0), self.point_radius, 0.1) The first ~20 or so frames are fine, and then each step starts taking minutes to compute.

image

(*the timings for rendering are just because I commented out the point renderer)

This is the difference between 200 mil and 500 mil particles, so I don't think its an order of magnitude different that should cause this problem.

If this isn't an issue please feel free to transfer to discussions as Q&A!

System Information

Win11 Warp 1.2.2 initialized: CUDA Toolkit 11.8, Driver 12.5 Devices: "cpu" : "AMD64 Family 25 Model 24 Stepping 1, AuthenticAMD" "cuda:1" : "NVIDIA RTX A6000" (48 GiB, sm_86, mempool enabled)

christophercrouzet commented 4 months ago

Hi @cadop, I personally can't run self.points = self.particle_grid(1024, 1024, 256, (0.0, 0.5, 0.0), self.point_radius, 0.1) because my GPU is running out memory, so I'm assuming the same might be happening in your case when you crank up the resolution a bit further? Can you check the memory usage of your GPU?

cadop commented 3 months ago

Hey,

It doesn't seem to be the case, for 1024,1024,512 this is a screenshot of the memory.

image

If it was ram, why would it run for the first few frames and not create a memory error later?

If I go to 1024,1024,1024 I do get an out of memory error.

shi-eric commented 3 months ago

@cadop: Can you modify https://github.com/NVIDIA/warp/blob/acfeabcf8f3414e1b131df3b47027dc5789c9113/warp/examples/core/example_dem.py#L178 to use with wp.ScopedTimer("step", synchronize=True):?

I noticed a similar behavior happening with the timing information printed out to stdout at lower resolutions. Since you turned off rendering there are no CPU synchronizations so the wp.ScopedTimer timings become quite confusing unless you use synchronize=True.

You can just run with python example_dem.py --stage_path None to skip the render step.

cadop commented 3 months ago

Be careful what you wish for... yes synchronize made everything equally slow :).

@shi-eric I can do some more exploration on my own, but given you have experience with this, is the stdout actually causing an issue? What causes the change in performance for synchronize?

For example, if I need to do the simulation with large numbers of particles, but only send back to the CPU some average particle position, I wouldn't need to copy the entire array of positions again. So if I am not rendering points out, the original test would be the time on GPU?

for posterity, my original post was in the situation where i was rendering, just not the points

image
shi-eric commented 3 months ago

What causes the change in performance for synchronize?

The issue here is that the default output of wp.ScopedTimer is not usually the time it takes for a block of code to be executed on the GPU due to the asynchronous nature of the CUDA API. Instead, it's typically measuring the time it takes to submit the work to the GPU. @nvlukasz recently expanded our docs about this: https://nvidia.github.io/warp/profiling.html#scopedtimer

My guess about what was making the wp.ScopedTimer in your original question start reporting times matching the time to execute rather than the time to submit work is that after a certain amount of calls, the GPU refuses to accept more items in the queue until the queue finishes processing the current item. @nvlukasz, what do you think?

cadop commented 3 months ago

That doc definitely helps, i'll dig into this more.

The profiler is pretty confusing within the dem example. I turn rendering back on, but I am getting render took 3932 ms... step took 0.25ms I am guessing this is because self.x.numpy() is in the render scope so sync doesnt happen in the step. maybe synchronize should be enabled by default (in the demo)? Or, I could make a PR to move out the numpy call from the renderer and put it in the scope.

nvlukasz commented 3 months ago

Hi @cadop, I agree that the output of the timers is a little confusing/misleading. The "step" timer only measures the time to schedule the CUDA work, not the time taken on the GPU to do the work. When rendering is enabled, there's generally some readback, as you noticed with the .numpy() calls. This readback is synchronous, so the "render" timer includes the time spent waiting for the previous GPU step to complete, which is misleading.

We can improve the existing timers to make them more accurate. Setting ScopedTimer(..., synchronize=True) would go a long way to clear up some confusion. That will show accurate measurements of how long each step and render takes, including the CPU and GPU work.

We could separate the readback under it's own timer, but it would complicate the examples a bit (and we should do it in all the examples for consistency). The examples are meant to be short and simple, they're not really meant to be precise benchmarks. I think doing synchronized timings should clarify the timings sufficiently.

Doing proper CPU/GPU timings can be a bit of a dark art sometimes :)