LunarG / gfxreconstruct

Graphics API Capture and Replay Tools for Reconstructing Graphics Application Behavior
https://vulkan.lunarg.com/doc/sdk/latest/linux/capture_tools.html
MIT License
385 stars 107 forks source link

Async compute work submitted at different time between replayer & native app #1066

Open rurra-amd opened 1 year ago

rurra-amd commented 1 year ago

Sometimes there is a mismatch as to when async compute submissions happen. The app submits it at time X, but the replayer submits it at time Y. We need to approximate them to be closer for accurate perf analysis.

per-mathisen-arm commented 1 year ago

Do you have some more information on when or how this happens?

andrew-lunarg commented 1 year ago

Caveat, I am more familiar with Vulkan capture and this issue may refer to D3D12, but because we synchronise threads implicitly inside the fwrite, even with a single core, if there are multiple threads recording, there is a chance of preemption between a call being forwarded down the chain to the driver and being recorded into the trace. During that preemption, a second thread could issue an API call down the chain and have it recorded in the trace. When the original thread is next scheduled it can issue the fwrite to record its call with the result that the trace contains them in a different order to the original one that they reached the driver.

If the new option implemented here (https://github.com/LunarG/gfxreconstruct/pull/1049) makes the ordering difference go away, the source of it may have been the mechanism I describe above. That's very much a hammer to crack a nut kind of workaround though and may radically change profiling results in its own way.

Multi-threaded replay should be a more robust solution, since the reordering we currently are liable to is inter-thread, but we do preserve ordering intra-thread. We should probably prioritise this approach since I can imagine situations with e.g. a resource upload thread and a render thread reordering synchronisation intermittently that lead to flat out bugs. Edit: Of course multi-threaded replay has its own challenges with the threads racing and the order in which calls reach the driver ending up radically changed from capture time.

Caveat 2: someone should probably check my thinking here. What I wrote is based on eyeballing the code, not on experimenting and proving that the situation described occurs.