LunarG / gfxreconstruct

Graphics API Capture and Replay Tools for Reconstructing Graphics Application Behavior
https://vulkan.lunarg.com/doc/sdk/latest/linux/capture_tools.html
MIT License
401 stars 115 forks source link

output crashing function and parameters #699

Open lunarpapillo opened 2 years ago

lunarpapillo commented 2 years ago

Developers would like to be able to capture an application (sometimes involving layers) that provokes a driver crash. The capture would be very useful in determining exactly what went wrong in a debugging situation (particularly on Android, which is difficult to debug otherwise).

Right now, gfxreconstruct will only write to the trace file post-call; right now, this causes the offending command to be lost, as the driver will have crashed before the information would have been saved.

Here are a few brainstormed alternatives for supporting this use case:

A. Two-stage capture writing

The capture layer writes the call data to the capture file before the call is processed, with some indicator that the call has not yet been passed on. After the call returns, the capture layer goes back to the file using seek() or equivalent and writes the post-call information into the proper location in the capture file. (If the file format were arranged so that all the pre-call information came first, this could be done without using seek().)

gfxrecon-replay would have to behave correctly when used on a call that has only pre-call information written.

B. Trap exceptions during capture

The capture layer could trap exceptions and other crashes, and output in some format information about a the last call to be captured.

:heavy_minus_sign: difficult to detect and react to all crashes on all supported OSes :heavy_minus_sign: crashing command does not go into the capture, so no way to collect this crash as a test case

C: As (B), but write the crashing command to the capture file after the crash

bradgrantham-lunarg commented 2 years ago

My initial thought about A is that if the output capture uses compression, then "seek" and "write" becomes "seek" and "rewrite the block". That doesn't make this impossible, just not as simple as saving ftell, then after success fseek, fwrite, then fseek back to end.

I'd like something like C, where we cause the crash to continue in the encoder function somehow, encoding a failure result code, and then terminate operation as soon as the ApiCall block is encoded and written. I'm not sure how to do that without setjmp/longjmp, though; throwing appears undefined from a signal handler.

I think the fundamental issue here is the intent with GFXR as I understand it is capturing a successful API stream and then replaying of that stream for regression testing, bringup, maybe performance testing, and maybe debugging. This issue is somewhat outside the original scope, although I agree it would be nice to have something useful happen here if its impact is low enough.

TonyBarbour commented 2 years ago

I would just reiterate the value of GFXR as a debugging tool. We often ask for GFXR reproductions of VVL bugs or crashes, and I've sent traces to driver developers to reproduce driver bugs. Anything that allows the trace to be used as a test case is a plus from my POV.

bradgrantham-lunarg commented 2 years ago

I agree it would be a nice thing to have. I'm thinking it may be possible to save off some kind of a "last-chance" call block saved off at the beginning of each interception function and a segfault may be able to write out that block with the compression type for the capture. But that would be a performance hit so I would think we'd make it conditional on an capture environment variable option. Would that work?

To my knowledge, which is admittedly thin, GFXR hasn't ever saved off an API call block for a crashed command, so presumably the captures we get and the captures you send weren't crashes in drivers? Or were the circumstances diagnosable from the calls before the crash?

lunarpapillo commented 2 years ago

@bradgrantham-lunarg said:

I think the fundamental issue here is the intent with GFXR as I understand it is capturing a successful API stream and then replaying of that stream for regression testing, bringup, maybe performance testing, and maybe debugging. This issue is somewhat outside the original scope, although I agree it would be nice to have something useful happen here if its impact is low enough.

I think I agree that such use cases are outside of the original scope (and maybe architecture) of gfxreconstruct... but I also agree with @Tony-LunarG :

I would just reiterate the value of GFXR as a debugging tool. We often ask for GFXR reproductions of VVL bugs or crashes, and I've sent traces to driver developers to reproduce driver bugs. Anything that allows the trace to be used as a test case is a plus from my POV.

If gfxreconstruct can capture and trim a crashing frame, it would become the easiest way to isolate driver and layer crashes into actionable data. It would be an invaluable tool for all low-level Vulkan developers, the one must-have accessory in the developer toolbox...

But I also agree that it's hard. As excited as I am by the possibility of raising the utility of this project for one subset of Vulkan users, I understand it may not be worth the ROI to develop, especially if major architectural changes were involved...

lunarpapillo commented 2 years ago

My initial thought about A is that if the output capture uses compression, then "seek" and "write" becomes "seek" and "rewrite the block".

Hmmm... I'm no expert, but I thought compression blocks typically collected many commands, and compressed the whole block after the fact and wrote to disk... if that's true, altering the still-in-memory block wouldn't be all that difficult... is my naivete showing?

(Of course, the difficulty of reacting to a crash in a useful and cross-platform way is still difficult.)