LunarG / gfxreconstruct

Graphics API Capture and Replay Tools for Reconstructing Graphics Application Behavior
https://vulkan.lunarg.com/doc/sdk/latest/linux/capture_tools.html
MIT License
396 stars 108 forks source link

Unable to capture Unity app on Meta Quest when using page_guard #1143

Open williamih opened 1 year ago

williamih commented 1 year ago

Describe the capture bug:

When trying to capture a trace of a Vulkan-based Unity app on Meta Quest Pro with the memory tracking mode set to page_guard, the app gets stuck opening. After selecting the app, the OS app loading screen showing three dots is displayed, and the device stays stuck on this screen and does not progress further.

When the gfxreconstruct layer is not installed, or when the memory tracking mode is assisted, the app works fine.

Submitting this after a discussion on the Vulkan discord server. I think this was your suggestion @bradgrantham-lunarg - thanks!

Build Environment:

Prebuilt gfxreconstruct binaries from https://github.com/LunarG/gfxreconstruct/releases/tag/sdk-1.3.250.0

Capture Settings:

"debug.gfxrecon.memory_tracking_mode" = "page_guard" "debug.gfxrecon.page_guard_persistent_memory" = "true" "debug.gfxrecon.page_guard_align_buffer_sizes" = "true" "debug.gfxrecon.page_guard_unblock_sigsegv" = "true" "debug.gfxrecon.page_guard_signal_handler_watcher" = "true"

"debug.gfxrecon.capture_file" is set to a file in "/sdcard/Download".

To Reproduce:

This was first observed on an internal build of an app I work on -- however, the issue happens straight after opening the app, so I expect this issue would happen with any Vulkan-based Unity app using the same driver, Unity version, etc.

Unity settings:

The layer was installed using the following method:

  1. adb push libVkLayer_gfxreconstruct.so /data/local/tmp
  2. adb shell run-as com.company.app cp /data/local/tmp/libVkLayer_gfxreconstruct.so .
  3. adb shell settings put global enable_gpu_debug_layers 1
  4. adb shell settings put global gpu_debug_app com.company.app
  5. adb shell settings put global gpu_debug_layers "VK_LAYER_LUNARG_gfxreconstruct" (where com.company.app is replaced with the APK identifier of the app)

System environment:

Additional details:

From the logs, it looks like SIGSEGV is being generated when internal Unity code i.e. vk::ImageManager::UploadThreaded() accesses Vulkan memory and is being treated by the OS as a crash.

gfxreconstruct_log_excerpt.txt

Thanks!

bradgrantham-lunarg commented 1 year ago

Hi, @williamih ! Can you upload the logcat with the gfxrecon logs in them? That's what I was hoping to see but there don't seem to be any in there.

bradgrantham-lunarg commented 1 year ago

It may be that the lack of gfxrecon lines in the logcat means that gfxrecon isn't being loaded. Does a trace file get created anywhere?

williamih commented 1 year ago

Thanks @bradgrantham-lunarg - sorry, was just trying to save time parsing through a lot of logs. There are gfxrecon lines, but (at least to me) I'm not seeing anything obviously wrong.

gfxreconstruct_logs_quest.txt

williamih commented 1 year ago

I can also confirm that a trace file is being created, although I'd think it doesn't contain much, since the app doesn't finish starting up

williamih commented 1 year ago

A few further findings:

  1. The unassisted memory tracking option works correctly on Quest; however, the performance impact is large and unfortunately is too slow to use the app normally.
  2. I tried to work around (1) by using debug.gfxrecon.capture_android_trigger and only setting this to true once I've moved to the location in the app that I want to capture. However, it seems this option is not responding at runtime -- when I enable this while the app is already running, the gfxrecon trace does not capture anything or write anything to disk.
  3. I also tried setting debug.gfxrecon.capture_frames to a frame range so that the trace only starts once I'm loaded into the app, but when I set this option, nothing is traced or saved to disk.

If I use unassisted and don't set debug.gfxrecon.capture_frames, and either leave debug.gfxrecon.capture_android_trigger as unset or set it to true, the trace works and saves to disk, but the app performance is so slow that I can't get to the place in the app that I want to capture.

panos-lunarg commented 1 year ago

Hi @williamih

I think this is clearly a case of our signal handler being replaced. The backtrace suggests that the segmentation faults takes place during a texture uploading (writing to mapped memory) and also unassisted method is not crashing.

The absence of the message PageGuardManager: Signal handler has been removed. Re-installing. from the logcat suggests that the watcher thread did not get a chance to detect the substitution before the crash happened

Nevertheless can you please try raising debug.gfxrecon.page_guard_signal_handler_watcher_max_restores from the default 1 to something like 2 or 3 or maybe even to something more aggressive like -1 which will effectively keep the watcher thread running? Maybe you can get the chance to get it to work

williamih commented 1 year ago

Hi @panos-lunarg

Thanks for looking into this - I'd missed that setting when looking at the docs originally, so it's good to know.

Oddly, when I tried to reproduce the issue today (with the same settings as last time), our app has been working correctly even using gfxrecon with page_guard. The issue of getting stuck on the app opening screen isn't happening. I'm not sure what's changed here.

However, there are still some further problems.

Regarding debug.gfxrecon.page_guard_signal_handler_watcher_max_restores, here's what I've found:

  1. Even after setting this to -1, I still don't see PageGuardManager: Signal handler has been removed. Re-installing. anywhere in the logs.
  2. Similarly, the error messages about a crash due to SIGSEGV are still being printed even after setting that option.

Although the app is running correctly with page_guard enabled now, it seems the trace is not being written out correctly. There are two .gfxr files being generated, one of which is 0 bytes and the other is only 4 KB, even after staying in the app for several minutes. Do you know why this might be?

Thanks!

panos-lunarg commented 1 year ago

That's odd. Can you please attach an unabridged logcat of a "successful" run?

williamih commented 1 year ago

Thanks @panos-lunarg

Logs here: logs-gfxrecon-infinite-restores.txt

panos-lunarg commented 1 year ago

Right, can't see anything new. I can't really explain how the application is not complete crashing but I still think this is because of our signal handler being replaced and the watcher thread does not get a chance to notice it.

I gave it a quick try to raise watcher's thread priority (or lowering parent's thread priority) but it looks like it's not that simple to do

Another user faced a similar issue in the past (#990). IIRC he got it working by doing a hack using https://github.com/iqiyi/xHook If you feel up to it you can give it a try.

williamih commented 1 year ago

That xHook library looks interesting - thanks for the suggestion @panos-lunarg.

I have a couple of other thoughts/questions:

  1. It would be fairly easy to modify our Vulkan code to use vkFlushMappedMemoryRanges() so that we could then use the assisted memory tracking mode. However, part of the app's rendering (mostly UI elements, etc) is handled by Unity, and we can't modify Unity's Vulkan code, so some modifications to memory mapped by Unity would be missed by gfxreconstruct. Could this cause corruption of the generated trace file or other errors, if the Unity-related writes were missed?

  2. As mentioned a bit earlier - I couldn't get the debug.gfxrecon.capture_android_trigger and debug.gfxrecon.capture_frames settings to work. Do you have any thoughts on how to investigate this?

panos-lunarg commented 1 year ago
  1. It would be fairly easy to modify our Vulkan code to use vkFlushMappedMemoryRanges() so that we could then use the assisted memory tracking mode. However, part of the app's rendering (mostly UI elements, etc) is handled by Unity, and we can't modify Unity's Vulkan code, so some modifications to memory mapped by Unity would be missed by gfxreconstruct. Could this cause corruption of the generated trace file or other errors, if the Unity-related writes were missed?

Yes this will cause corruption. Imagine you are uploading a texture but gfxr fails to notice that, so the texture's content are not dumped into the capture file. When you replay the file, gfxr will not update the device memory so the texture will be filled with uninitialized data.

  1. As mentioned a bit earlier - I couldn't get the debug.gfxrecon.capture_android_trigger and debug.gfxrecon.capture_frames settings to work. Do you have any thoughts on how to investigate this?

IIRC in order to use debug.gfxrecon.capture_android_trigger you have to set it to false when you start capturing and then set it to true to trigger trimming (then false again to stop). If it isn't set at all at the beginning then you'll be capturing as normal (non trimmed). I prepared a patch that should now print to logcat when this changes values (#1147) It's strange that debug.gfxrecon.capture_frames isn't doing anything either. Can you see anything in logcat? Gfxr should print something there when capturing trimmed even before you enter the trim range.

williamih commented 1 year ago

Thanks for the info - makes sense. With debug.gfxrecon.capture_android_trigger, I've been setting it to false before starting the app, then setting it to true once the app is running. However, nothing is captured. I can see from the logs that gfxr is detecting the false value when the app starts up, but it's not logging detection of the later modification to true. (Although maybe it's not meant to log that, since you created a patch? But it's definitely logging the initial false value, and not logging the change to true, and also not writing out capture data).

I can try both options again tomorrow to double check.

panos-lunarg commented 1 year ago

BTW doing a trimmed capture is not going to make the crash go away. page_guard mechanism needs to keep track of mapped device memory while doing trimmed captures, even outside the frame-range / trigger-regions.

panos-lunarg commented 1 year ago

I started experimenting with integrating xHook sources in GFXR in patch #1150 I couldn't get it to work on desktop but it did work on Android with Unreal (Fortnite) where we observe the same behavior. Please give it a try and let me know it it works for you too. Basically apply #1150 and do not use the signal watcher option (set debug.gfxrecon.page_guard_signal_handler_watcher to 0)

williamih commented 1 year ago

Thanks @panos-lunarg - I'll try the xHook patch out!

I was thinking of using a trimmed capture with unassisted memory tracking as I thought that might reduce some of the performance hit until I get to the part of the app that I want to trace. But based on your explanation of how page_guard works, I'd assume unassisted also still does all the tracking even with trimmed captures, and so there probably wouldn't be any performance benefit there?

panos-lunarg commented 1 year ago

Ah right. I would expect unassisted to be much faster outside the trim regions as it will not be dumping the mapped memory to the capture file. So it should be worth a try if for example you only want to capture a few frames from a specific scene and you don't care too much about precision.

williamih commented 1 year ago

Makes sense. I still can't get the debug.gfxrecon.capture_frames setting working unfortunately. When I set this option, nothing seems to be captured (no files written). Any thoughts on this?

williamih commented 1 year ago

Also - to test with the xHook patch (https://github.com/LunarG/gfxreconstruct/pull/1150), is there a way to get the Android binary from the CI build, or should I just build from source locally?

bradgrantham-lunarg commented 1 year ago

Makes sense. I still can't get the debug.gfxrecon.capture_frames setting working unfortunately. When I set this option, nothing seems to be captured (no files written). Any thoughts on this?

Does it work if you set to "1-100"? What are you setting it to now?

panos-lunarg commented 1 year ago

I was about to suggest the same as @bradgrantham-lunarg: What are you setting debug.gfxrecon.capture_frames to? If you request a very high frame number, depending on the application's frame rate, it can take a considerable time to reach it, and until it is reached GFXR will not create a capture file. Try something low to make sure that it works

Choosing the correct frame range can be a tough guessing game so I would suggest trying debug.gfxrecon.capture_android_trigger as you have a better control of when to start capturing. I just merged #1147 which should make visible in logcat whencapture_android_trigger changes values.

Regarding #1150, yes it would require you to build GFXR locally. I don't think the CI builds are visible externally so If you are not comfortable with building it locally I can upload it for you here.

williamih commented 1 year ago

Oddly, when I run adb shell "setprop debug.gfxrecon.capture_frames '1-100'" (following your idea @bradgrantham-lunarg) it works fine, i.e. saves a trace file.

But when I run adb shell "setprop debug.gfxrecon.capture_frames '2-100'" nothing gets traced or saved to a file. Logs here for the 2-100 frame range which isn't working:

logs-2-100.txt

williamih commented 1 year ago

I'll try building myself as that's probably useful to be able to do, but can let you know if I run into any issues - thanks.

panos-lunarg commented 1 year ago

Interesting. Seems like we are missing the starting frame. Drafted a PR (#1153) to address it

bradgrantham-lunarg commented 1 year ago

Oddly, when I run adb shell "setprop debug.gfxrecon.capture_frames '1-100'" (following your idea @bradgrantham-lunarg) it works fine, i.e. saves a trace file.

Your logs indicate that the "2-100" range was detected, but I don't understand how the logic in the code could miss the beginning of the range. I'd have to suspect a strange Android behavior for strings or some such or maybe even a memory stomp.

Sorry to keep bombarding you with requests for logs. Could you send the logs for the successful run for 1-100?

williamih commented 1 year ago

@bradgrantham-lunarg Thanks, logs here for 1-100: logs-1-100.txt

bradgrantham-lunarg commented 1 year ago

I think that because you're running an XR app on Quest, you're never getting a QueuePresentKHR, so we're never incrementing our frame counter. Thus we never enter frame 2. @davidd-lunarg , is this app a good candidate for testing out https://github.com/LunarG/gfxreconstruct/pull/1138 ?

davidd-lunarg commented 1 year ago

is this app a good candidate for testing out https://github.com/LunarG/gfxreconstruct/pull/1138 ?

Yes, this may be a good app to test with #1138. If the VR app is not using vkQueuePresentKHR then GFXR will not detect frame boundaries and the entire capture will treated as frame 1 by GFXR. Note that to test #1138 the android property debug.vr.profiler must be set to 1 for capture.

As to why trim frames 1-100 produces a capture file but 2-100 does not, this may be due to the fact that trim ranges starting at frame 1 are treated more like full captures--there is no need to track and write a trim state block. However if vkQueuePresentKHR is not being called I would guess the capture does not stop at frame 100. Instead it will capture until the app is closed.

williamih commented 1 year ago

Sorry for the delayed reply - not sure if/when I'll get a chance to test out those PRs as there's a few other work priorities I need to focus on right now. Thanks for the help investigating - this does seem like a really useful tool so hopefully I can come back to this sometime and get it going!