Closed andrew-lunarg closed 1 year ago
Error is intermittent. Is this a race?
First frame from two identical assisted
capture runs (IMGui overlay does not appear in first frame):
page_guard
tracking mode with aligned option is also intermittent but very limited testing suggests it works less often than assisted
.
The replay of the first frame good capture with page_guard
also fails intermittently, showing the same defects visually (it is reported as two frames by the replay tool).
The trace: samples_buffer_device_address_frame_1.page_guard_aigned.valid_image.zip
Low probability speculation: I wonder if the issue is a sync / race issue in the original sample that only shows up intermittently when either our capture or replay is running due to timing difference.
So far this is what I have observed:
llvmpipe (13.0.1):
Captures ok with GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES=true
. Otherwise the rectangles disappear.
Does not replay due to
[gfxrecon] ERROR - The captured application used vkGetBufferDeviceAddress, which requires the bufferDeviceAddressCaptureReplay feature for accurate capture and replay. The replay device does not support this feature, so replay may fail.
Integrated intel gpu:
Captures ok with GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES=true
. Otherwise the rectangles disappear.
Replays ok
Nvidia geforce 1650
Captures ok with or without GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES=true
.
Does not replay due to
[gfxrecon] ERROR - The captured application used vkGetBufferDeviceAddress, which requires the bufferDeviceAddressCaptureReplay feature for accurate capture and replay. The replay device does not support this feature, so replay may fail.
The above reproduce consistently for me (no intermittent artifacts).
I duplicated this issue on my Ubuntu 20.04 laptop with a GTX 1050 Ti Mobile GPU. I didn't get exactly the images above, but I got some very corrupt images. I'll investigate this further.
This issue is due to do the sample program mapping a memory buffer in order to get a copy of an image buffer from the GPU, but the page guard feature in the capture layer gets in the way of this. If page guard is effectively disabled by setting GFXRECON_MEMORY_TRACKING_MODE to "assisted", the sample program successfully creates correct screenshot image files.
I think the problem is that the sample program generates an image in a memory buffer, to be read back by the CPU to create an image file. However, the capture layer with page guard instead treats accessess to the memory buffer from the app as writes to the buffer, and so it instead copies data from the buffer in system RAM to the buffer in the GPU.
This same problem occurs with a simple Vulkan program like vkcube when both the capture layer and the screenshot layer are enabled. Screenshots generated in this case are not correct.
Interestingly, this problem does not occur on Windows. The page guard implementation is slightly different on Windows, and I'm not sure how those differences result in correct image file being created. Maybe @panos-lunarg can provide an insight?
I think the problem is that the sample program generates an image in a memory buffer, to be read back by the CPU to create an image file. However, the capture layer with page guard instead treats accessess to the memory buffer from the app as writes to the buffer, and so it instead copies data from the buffer in system RAM to the buffer in the GPU.
Correct. Currently when page_guard detects a read from a page it automatically marks that page as written as well. The order of things is:
PROT_NONE
+ SIGSEGV
trick.vkQueueSubmit(2)
/vkUnmapMemory
/vkFlushMappedMemoryRanges
it will see that the page is marked as written and it will do another memory copy, this time from the shadow memory into the mapped memory.Apart from an occasional additional memcpy
, I can't think of a way right now that this can cause corruption/artifacts.
You can disable the "mark pages that are read as written as well" behavior by changing:
static const bool kDefaultEnableReadWriteSamePage = true;
from true
to false
. IIRC this is hard coded and there is not environment variable option to control this.
Interestingly, this problem does not occur on Windows. The page guard implementation is slightly different on Windows, and I'm not sure how those differences result in correct image file being created.
The tracking mechanism is different on Windows but only as far as the underlying mechanism that does the read/write tracking provided by the OS. Once accesses are caught the handling should be the same (the order of actions is the same as both call HandleGuardPageViolation
).
I guess you have already tried it but just to make sure: When capturing do you see this message:
[Buffer|Image] bound to device memory at an offset which is not page aligned. Corruption might occur. In that case set Page Guard Align Buffer Sizes env variable to true.
and if so have you tried setting GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES
to true?
@panos-lunarg, thanks for your response. It prompted me to look at all environment variables set when this sample program is run. I have determined that if I use:
VK_LOADER_LAYERS_ENABLE=VK_LAYER_LUNARG_gfxreconstruct GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES=true GFXRECON_PAGE_GUARD_COPY_ON_MAP=false dbuild/linux/app/bin/Debug/x86_64/vulkan_samples sample buffer_device_address --stop-after-frame 10 --screenshot 10 --screenshot-output samples_screenshot_frame_1
the generated png file of frame 10 is correct. I think the capture layer is behaving as expected as this bug was originally reported, and that using the environment variables above addresses the bug.
Will close this issue.
This sample produces an incorrect image during capture. This is an example of issue #590 that has no known fix with parameter tweaking.
Without capture the sample looks like this:
Commandlines used (paths would need adjusting for repro)
Above commandline gives:
There were same results for all 16 permutations of the 4 page_guard flags of interest (ignoring the SIGSEGV one).
Unassisted presented differently:
Assisted was similar to unassisted:
From the sample's readme:
Fun stuff indeed.
Sounds like we could have work to do. Interesting that we fail at capture, when replay should have been the issue.
Environment