LunarG / gfxreconstruct

Graphics API Capture and Replay Tools for Reconstructing Graphics Application Behavior
https://vulkan.lunarg.com/doc/sdk/latest/linux/capture_tools.html
MIT License
400 stars 114 forks source link

Khronos Vulkan-Samples - buffer_device_address Fails to Capture #741

Closed andrew-lunarg closed 1 year ago

andrew-lunarg commented 2 years ago

This sample produces an incorrect image during capture. This is an example of issue #590 that has no known fix with parameter tweaking.

Without capture the sample looks like this: image

Commandlines used (paths would need adjusting for repro)

VK_INSTANCE_LAYERS="VK_LAYER_LUNARG_gfxreconstruct:$VK_INSTANCE_LAYERS" GFXRECON_LOG_LEVEL=debug GFXRECON_LOG_DETAILED=true LD_LIBRARY_PATH="/home/andrew/mesa_debug/lib/x86_64-linux-gnu/:/home/andrew/checkouts/khronos-vulkan-loader/build/debug/install/lib/:$LD_LIBRARY_PATH" VK_LAYER_PATH=/home/andrew/lunarg/checkouts/andrew-lunarg-gfxreconstruct/build/linux/vscode/layer/:"$VK_LAYER_PATH" GFXRECON_CAPTURE_FILE=/home/andrew/temp/samples_temp.gfxr GFXRECON_CAPTURE_FILE_TIMESTAMP=false  GFXRECON_MEMORY_TRACKING_MODE=page_guard GFXRECON_PAGE_GUARD_PERSISTENT_MEMORY=false GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES=false GFXRECON_PAGE_GUARD_COPY_ON_MAP=false  GFXRECON_PAGE_GUARD_SEPARATE_READ=false GFXRECON_PAGE_GUARD_UNBLOCK_SIGSEGV=false build/linux/vscode/app/bin/Debug/x86_64/vulkan_samples sample buffer_device_address  --stop-after-frame 10 --screenshot 10 --screenshot-output samples_screenshot_frame_1

Above commandline gives: image

There were same results for all 16 permutations of the 4 page_guard flags of interest (ignoring the SIGSEGV one).

Unassisted presented differently: samples_screenshot_frame_9

Assisted was similar to unassisted: samples_screenshot_frame_9

From the sample's readme:

Buffer device address is a very powerful and unique feature to Vulkan which is not present in any other modern graphics API. The main gist of it is that it exposes GPU virtual addresses directly to the application, and the application can then use said address to access buffer data freely through pointers rather than descriptors. What makes this feature unique is that we can place these addresses in buffers and load and store to them inside shaders, with full capability to perform pointer arithmetic and other fun tricks.

Fun stuff indeed.

When debugging or capturing an application that uses buffer device addresses, there are some special driver requirements that are not universally supported. Essentially, to be able to capture application buffers which contain raw pointers, we must ensure that the device address for a given buffer remains stable when the capture is replayed in a new process. Applications do not have to do anything here, since tools like RenderDoc will enable the bufferDeviceAddressCaptureReplay feature for you, and deal with all the magic associated with address capture behind the scenes. If the bufferDeviceAddressCaptureReplay is not present however, tools like RenderDoc will mask out the bufferDeviceAddress feature, so beware.

Sounds like we could have work to do. Interesting that we fail at capture, when replay should have been the issue.

Environment

andrew-lunarg commented 2 years ago

Error is intermittent. Is this a race? First frame from two identical assisted capture runs (IMGui overlay does not appear in first frame): samples_buffer_device_address_frame_1 assisted samples_buffer_device_address_frame_1_02 assisted

page_guard tracking mode with aligned option is also intermittent but very limited testing suggests it works less often than assisted.

andrew-lunarg commented 2 years ago

The replay of the first frame good capture with page_guard also fails intermittently, showing the same defects visually (it is reported as two frames by the replay tool). The trace: samples_buffer_device_address_frame_1.page_guard_aigned.valid_image.zip

Low probability speculation: I wonder if the issue is a sync / race issue in the original sample that only shows up intermittently when either our capture or replay is running due to timing difference.

panos-lunarg commented 1 year ago

So far this is what I have observed: llvmpipe (13.0.1): Captures ok with GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES=true. Otherwise the rectangles disappear. Does not replay due to

[gfxrecon] ERROR - The captured application used vkGetBufferDeviceAddress, which requires the bufferDeviceAddressCaptureReplay feature for accurate capture and replay. The replay device does not support this feature, so replay may fail.

Integrated intel gpu: Captures ok with GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES=true. Otherwise the rectangles disappear. Replays ok

Nvidia geforce 1650 Captures ok with or without GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES=true. Does not replay due to

[gfxrecon] ERROR - The captured application used vkGetBufferDeviceAddress, which requires the bufferDeviceAddressCaptureReplay feature for accurate capture and replay. The replay device does not support this feature, so replay may fail.

The above reproduce consistently for me (no intermittent artifacts).

davidlunarg commented 1 year ago

I duplicated this issue on my Ubuntu 20.04 laptop with a GTX 1050 Ti Mobile GPU. I didn't get exactly the images above, but I got some very corrupt images. I'll investigate this further.

davidlunarg commented 1 year ago

This issue is due to do the sample program mapping a memory buffer in order to get a copy of an image buffer from the GPU, but the page guard feature in the capture layer gets in the way of this. If page guard is effectively disabled by setting GFXRECON_MEMORY_TRACKING_MODE to "assisted", the sample program successfully creates correct screenshot image files.

I think the problem is that the sample program generates an image in a memory buffer, to be read back by the CPU to create an image file. However, the capture layer with page guard instead treats accessess to the memory buffer from the app as writes to the buffer, and so it instead copies data from the buffer in system RAM to the buffer in the GPU.

This same problem occurs with a simple Vulkan program like vkcube when both the capture layer and the screenshot layer are enabled. Screenshots generated in this case are not correct.

Interestingly, this problem does not occur on Windows. The page guard implementation is slightly different on Windows, and I'm not sure how those differences result in correct image file being created. Maybe @panos-lunarg can provide an insight?

panos-lunarg commented 1 year ago

I think the problem is that the sample program generates an image in a memory buffer, to be read back by the CPU to create an image file. However, the capture layer with page guard instead treats accessess to the memory buffer from the app as writes to the buffer, and so it instead copies data from the buffer in system RAM to the buffer in the GPU.

Correct. Currently when page_guard detects a read from a page it automatically marks that page as written as well. The order of things is:

  1. A read is detected with the PROT_NONE + SIGSEGV trick.
  2. At the the time of the detection a memory copy from the actual mapped memory into the shadow memory is done in order for the host to see the updated data. The page is marked both as read and written.
  3. At vkQueueSubmit(2)/vkUnmapMemory/vkFlushMappedMemoryRanges it will see that the page is marked as written and it will do another memory copy, this time from the shadow memory into the mapped memory.

Apart from an occasional additional memcpy, I can't think of a way right now that this can cause corruption/artifacts.

You can disable the "mark pages that are read as written as well" behavior by changing:

static const bool kDefaultEnableReadWriteSamePage         = true;

from true to false. IIRC this is hard coded and there is not environment variable option to control this.

Interestingly, this problem does not occur on Windows. The page guard implementation is slightly different on Windows, and I'm not sure how those differences result in correct image file being created.

The tracking mechanism is different on Windows but only as far as the underlying mechanism that does the read/write tracking provided by the OS. Once accesses are caught the handling should be the same (the order of actions is the same as both call HandleGuardPageViolation).

I guess you have already tried it but just to make sure: When capturing do you see this message:

[Buffer|Image] bound to device memory at an offset which is not page aligned. Corruption might occur. In that case set Page Guard Align Buffer Sizes env variable to true.

and if so have you tried setting GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES to true?

davidlunarg commented 1 year ago

@panos-lunarg, thanks for your response. It prompted me to look at all environment variables set when this sample program is run. I have determined that if I use:

VK_LOADER_LAYERS_ENABLE=VK_LAYER_LUNARG_gfxreconstruct GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES=true GFXRECON_PAGE_GUARD_COPY_ON_MAP=false dbuild/linux/app/bin/Debug/x86_64/vulkan_samples sample buffer_device_address --stop-after-frame 10 --screenshot 10 --screenshot-output samples_screenshot_frame_1

the generated png file of frame 10 is correct. I think the capture layer is behaving as expected as this bug was originally reported, and that using the environment variables above addresses the bug.

Will close this issue.