Open williamih opened 1 year ago
Hi, @williamih ! Can you upload the logcat with the gfxrecon logs in them? That's what I was hoping to see but there don't seem to be any in there.
It may be that the lack of gfxrecon
lines in the logcat means that gfxrecon isn't being loaded. Does a trace file get created anywhere?
Thanks @bradgrantham-lunarg - sorry, was just trying to save time parsing through a lot of logs. There are gfxrecon
lines, but (at least to me) I'm not seeing anything obviously wrong.
I can also confirm that a trace file is being created, although I'd think it doesn't contain much, since the app doesn't finish starting up
A few further findings:
unassisted
memory tracking option works correctly on Quest; however, the performance impact is large and unfortunately is too slow to use the app normally.debug.gfxrecon.capture_android_trigger
and only setting this to true once I've moved to the location in the app that I want to capture. However, it seems this option is not responding at runtime -- when I enable this while the app is already running, the gfxrecon trace does not capture anything or write anything to disk.debug.gfxrecon.capture_frames
to a frame range so that the trace only starts once I'm loaded into the app, but when I set this option, nothing is traced or saved to disk.If I use unassisted
and don't set debug.gfxrecon.capture_frames
, and either leave debug.gfxrecon.capture_android_trigger
as unset or set it to true, the trace works and saves to disk, but the app performance is so slow that I can't get to the place in the app that I want to capture.
Hi @williamih
I think this is clearly a case of our signal handler being replaced. The backtrace suggests that the segmentation faults takes place during a texture uploading (writing to mapped memory) and also unassisted
method is not crashing.
The absence of the message PageGuardManager: Signal handler has been removed. Re-installing.
from the logcat suggests that the watcher thread did not get a chance to detect the substitution before the crash happened
Nevertheless can you please try raising debug.gfxrecon.page_guard_signal_handler_watcher_max_restores
from the default 1 to something like 2 or 3 or maybe even to something more aggressive like -1 which will effectively keep the watcher thread running? Maybe you can get the chance to get it to work
Hi @panos-lunarg
Thanks for looking into this - I'd missed that setting when looking at the docs originally, so it's good to know.
Oddly, when I tried to reproduce the issue today (with the same settings as last time), our app has been working correctly even using gfxrecon
with page_guard
. The issue of getting stuck on the app opening screen isn't happening. I'm not sure what's changed here.
However, there are still some further problems.
Regarding debug.gfxrecon.page_guard_signal_handler_watcher_max_restores
, here's what I've found:
PageGuardManager: Signal handler has been removed. Re-installing.
anywhere in the logs.Although the app is running correctly with page_guard
enabled now, it seems the trace is not being written out correctly. There are two .gfxr
files being generated, one of which is 0 bytes and the other is only 4 KB, even after staying in the app for several minutes. Do you know why this might be?
Thanks!
That's odd. Can you please attach an unabridged logcat of a "successful" run?
Thanks @panos-lunarg
Logs here: logs-gfxrecon-infinite-restores.txt
Right, can't see anything new. I can't really explain how the application is not complete crashing but I still think this is because of our signal handler being replaced and the watcher thread does not get a chance to notice it.
I gave it a quick try to raise watcher's thread priority (or lowering parent's thread priority) but it looks like it's not that simple to do
Another user faced a similar issue in the past (#990). IIRC he got it working by doing a hack using https://github.com/iqiyi/xHook If you feel up to it you can give it a try.
That xHook library looks interesting - thanks for the suggestion @panos-lunarg.
I have a couple of other thoughts/questions:
It would be fairly easy to modify our Vulkan code to use vkFlushMappedMemoryRanges()
so that we could then use the assisted
memory tracking mode. However, part of the app's rendering (mostly UI elements, etc) is handled by Unity, and we can't modify Unity's Vulkan code, so some modifications to memory mapped by Unity would be missed by gfxreconstruct
. Could this cause corruption of the generated trace file or other errors, if the Unity-related writes were missed?
As mentioned a bit earlier - I couldn't get the debug.gfxrecon.capture_android_trigger
and debug.gfxrecon.capture_frames
settings to work. Do you have any thoughts on how to investigate this?
- It would be fairly easy to modify our Vulkan code to use vkFlushMappedMemoryRanges() so that we could then use the assisted memory tracking mode. However, part of the app's rendering (mostly UI elements, etc) is handled by Unity, and we can't modify Unity's Vulkan code, so some modifications to memory mapped by Unity would be missed by gfxreconstruct. Could this cause corruption of the generated trace file or other errors, if the Unity-related writes were missed?
Yes this will cause corruption. Imagine you are uploading a texture but gfxr fails to notice that, so the texture's content are not dumped into the capture file. When you replay the file, gfxr will not update the device memory so the texture will be filled with uninitialized data.
- As mentioned a bit earlier - I couldn't get the debug.gfxrecon.capture_android_trigger and debug.gfxrecon.capture_frames settings to work. Do you have any thoughts on how to investigate this?
IIRC in order to use debug.gfxrecon.capture_android_trigger
you have to set it to false when you start capturing and then set it to true
to trigger trimming (then false
again to stop). If it isn't set at all at the beginning then you'll be capturing as normal (non trimmed). I prepared a patch that should now print to logcat when this changes values (#1147)
It's strange that debug.gfxrecon.capture_frames
isn't doing anything either. Can you see anything in logcat? Gfxr should print something there when capturing trimmed even before you enter the trim range.
Thanks for the info - makes sense. With debug.gfxrecon.capture_android_trigger
, I've been setting it to false
before starting the app, then setting it to true
once the app is running. However, nothing is captured. I can see from the logs that gfxr is detecting the false
value when the app starts up, but it's not logging detection of the later modification to true
. (Although maybe it's not meant to log that, since you created a patch? But it's definitely logging the initial false value, and not logging the change to true, and also not writing out capture data).
I can try both options again tomorrow to double check.
BTW doing a trimmed capture is not going to make the crash go away. page_guard
mechanism needs to keep track of mapped device memory while doing trimmed captures, even outside the frame-range / trigger-regions.
I started experimenting with integrating xHook sources in GFXR in patch #1150
I couldn't get it to work on desktop but it did work on Android with Unreal (Fortnite) where we observe the same behavior.
Please give it a try and let me know it it works for you too. Basically apply #1150 and do not use the signal watcher option (set debug.gfxrecon.page_guard_signal_handler_watcher
to 0
)
Thanks @panos-lunarg - I'll try the xHook patch out!
I was thinking of using a trimmed capture with unassisted
memory tracking as I thought that might reduce some of the performance hit until I get to the part of the app that I want to trace. But based on your explanation of how page_guard
works, I'd assume unassisted
also still does all the tracking even with trimmed captures, and so there probably wouldn't be any performance benefit there?
Ah right. I would expect unassisted
to be much faster outside the trim regions as it will not be dumping the mapped memory to the capture file. So it should be worth a try if for example you only want to capture a few frames from a specific scene and you don't care too much about precision.
Makes sense. I still can't get the debug.gfxrecon.capture_frames
setting working unfortunately. When I set this option, nothing seems to be captured (no files written). Any thoughts on this?
Also - to test with the xHook patch (https://github.com/LunarG/gfxreconstruct/pull/1150), is there a way to get the Android binary from the CI build, or should I just build from source locally?
Makes sense. I still can't get the
debug.gfxrecon.capture_frames
setting working unfortunately. When I set this option, nothing seems to be captured (no files written). Any thoughts on this?
Does it work if you set to "1-100"? What are you setting it to now?
I was about to suggest the same as @bradgrantham-lunarg: What are you setting debug.gfxrecon.capture_frames
to? If you request a very high frame number, depending on the application's frame rate, it can take a considerable time to reach it, and until it is reached GFXR will not create a capture file. Try something low to make sure that it works
Choosing the correct frame range can be a tough guessing game so I would suggest trying debug.gfxrecon.capture_android_trigger
as you have a better control of when to start capturing. I just merged #1147 which should make visible in logcat whencapture_android_trigger
changes values.
Regarding #1150, yes it would require you to build GFXR locally. I don't think the CI builds are visible externally so If you are not comfortable with building it locally I can upload it for you here.
Oddly, when I run adb shell "setprop debug.gfxrecon.capture_frames '1-100'"
(following your idea @bradgrantham-lunarg) it works fine, i.e. saves a trace file.
But when I run adb shell "setprop debug.gfxrecon.capture_frames '2-100'"
nothing gets traced or saved to a file. Logs here for the 2-100 frame range which isn't working:
I'll try building myself as that's probably useful to be able to do, but can let you know if I run into any issues - thanks.
Interesting. Seems like we are missing the starting frame. Drafted a PR (#1153) to address it
Oddly, when I run
adb shell "setprop debug.gfxrecon.capture_frames '1-100'"
(following your idea @bradgrantham-lunarg) it works fine, i.e. saves a trace file.
Your logs indicate that the "2-100" range was detected, but I don't understand how the logic in the code could miss the beginning of the range. I'd have to suspect a strange Android behavior for strings or some such or maybe even a memory stomp.
Sorry to keep bombarding you with requests for logs. Could you send the logs for the successful run for 1-100?
@bradgrantham-lunarg Thanks, logs here for 1-100: logs-1-100.txt
I think that because you're running an XR app on Quest, you're never getting a QueuePresentKHR, so we're never incrementing our frame counter. Thus we never enter frame 2. @davidd-lunarg , is this app a good candidate for testing out https://github.com/LunarG/gfxreconstruct/pull/1138 ?
is this app a good candidate for testing out https://github.com/LunarG/gfxreconstruct/pull/1138 ?
Yes, this may be a good app to test with #1138. If the VR app is not using vkQueuePresentKHR then GFXR will not detect frame boundaries and the entire capture will treated as frame 1 by GFXR. Note that to test #1138 the android property debug.vr.profiler
must be set to 1
for capture.
As to why trim frames 1-100 produces a capture file but 2-100 does not, this may be due to the fact that trim ranges starting at frame 1 are treated more like full captures--there is no need to track and write a trim state block. However if vkQueuePresentKHR is not being called I would guess the capture does not stop at frame 100. Instead it will capture until the app is closed.
Sorry for the delayed reply - not sure if/when I'll get a chance to test out those PRs as there's a few other work priorities I need to focus on right now. Thanks for the help investigating - this does seem like a really useful tool so hopefully I can come back to this sometime and get it going!
Describe the capture bug:
When trying to capture a trace of a Vulkan-based Unity app on Meta Quest Pro with the memory tracking mode set to
page_guard
, the app gets stuck opening. After selecting the app, the OS app loading screen showing three dots is displayed, and the device stays stuck on this screen and does not progress further.When the
gfxreconstruct
layer is not installed, or when the memory tracking mode isassisted
, the app works fine.Submitting this after a discussion on the Vulkan discord server. I think this was your suggestion @bradgrantham-lunarg - thanks!
Build Environment:
Prebuilt
gfxreconstruct
binaries from https://github.com/LunarG/gfxreconstruct/releases/tag/sdk-1.3.250.0Capture Settings:
"debug.gfxrecon.memory_tracking_mode" = "page_guard" "debug.gfxrecon.page_guard_persistent_memory" = "true" "debug.gfxrecon.page_guard_align_buffer_sizes" = "true" "debug.gfxrecon.page_guard_unblock_sigsegv" = "true" "debug.gfxrecon.page_guard_signal_handler_watcher" = "true"
"debug.gfxrecon.capture_file" is set to a file in "/sdcard/Download".
To Reproduce:
This was first observed on an internal build of an app I work on -- however, the issue happens straight after opening the app, so I expect this issue would happen with any Vulkan-based Unity app using the same driver, Unity version, etc.
Unity settings:
The layer was installed using the following method:
adb push libVkLayer_gfxreconstruct.so /data/local/tmp
adb shell run-as com.company.app cp /data/local/tmp/libVkLayer_gfxreconstruct.so .
adb shell settings put global enable_gpu_debug_layers 1
adb shell settings put global gpu_debug_app com.company.app
adb shell settings put global gpu_debug_layers "VK_LAYER_LUNARG_gfxreconstruct"
(wherecom.company.app
is replaced with the APK identifier of the app)System environment:
Additional details:
From the logs, it looks like SIGSEGV is being generated when internal Unity code i.e.
vk::ImageManager::UploadThreaded()
accesses Vulkan memory and is being treated by the OS as a crash.gfxreconstruct_log_excerpt.txt
Thanks!