LunarG / gfxreconstruct

Graphics API Capture and Replay Tools for Reconstructing Graphics Application Behavior
https://vulkan.lunarg.com/doc/sdk/latest/linux/capture_tools.html
MIT License
416 stars 123 forks source link

webgpu: vkAllocateMemory returned error value VK_ERROR_INVALID_EXTERNAL_HANDLE #1743

Open jeremyg-lunarg opened 2 months ago

jeremyg-lunarg commented 2 months ago

Describe the replay bug: This is a replay of webgpu content running on linux with the RADV driver. WebGPU renders into a swapchain image created by chromium. It is then passed to a compositor in chromium. Both components are Vulkan running with their own VkInstance and VkDevice.

[gfxrecon] FATAL - API call at index: 4261 thread: 1 vkAllocateMemory returned error value VK_ERROR_INVALID_EXTERNAL_HANDLE that does not match the result from the capture file: VK_SUCCESS. Replay cannot continue.
Replay has encountered a fatal error and cannot continue: an external handle is not a valid handle of the specified type

It looks like at some point vkGetMemoryFdKHR() returns a -1 file descriptor but I could be getting confused looking at the output.

note: chromium might be doing graphics stuff in multiple processes, looking at the gfxr-convert output I think everything is in 1 process and captured but I'm not 100% sure.

Verify before submission:

Build Environment: Please include the SHA and PR or branch name used in capture and also used to build the replayer.

1.3.290 SDK

To Reproduce Steps to reproduce the behavior:

  1. Get the .gfxr file attached to the issue.
  2. . Run gfxrecon-replay with gfxrecon_capture_frames_1_through_5000_20240916T105521.gfxr, no arguments

Screenshots: Does not run long enough for screenshots.

System environment: Capture and replay on the same system running Ubuntu 24.04 with the RADV driver

Title configuration: life branch of https://github.com/jeremyg-lunarg/webgpu-electron

With npm and node.js installed: npm install npm run start

Additional information (optional):

jeremyg-lunarg commented 2 months ago

gfxrecon_capture_frames_1_through_5000_20240916T105521.gfxr.gz

jeremyg-lunarg commented 2 months ago

I think I see what is happening. Here's the memory export to fd 75:

{
  "index": 4255,
  "function": {
    "name": "vkGetMemoryFdKHR",
    "thread": 1,
    "return": "VK_SUCCESS",
    "args": {
      "device": 7,
      "pGetFdInfo": {
        "sType": "VK_STRUCTURE_TYPE_MEMORY_GET_FD_INFO_KHR",
        "memory": 585,
        "handleType": "VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT",
        "pNext": null
      },
      "pFd": 75
    }
  }
},

Then the import happens to fd 76:

{
  "index": 4261,
  "function": {
    "name": "vkAllocateMemory",
    "thread": 1,
    "return": "VK_SUCCESS",
    "args": {
      "device": 38,
      "pAllocateInfo": {
        "sType": "VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO",
        "allocationSize": 1048576,
        "memoryTypeIndex": 0,
        "pNext": {
          "sType": "VK_STRUCTURE_TYPE_IMPORT_MEMORY_FD_INFO_KHR",
          "handleType": "VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT",
          "fd": 76,
          "pNext": {
            "sType": "VK_STRUCTURE_TYPE_MEMORY_DEDICATED_ALLOCATE_INFO",
            "image": 586,
            "buffer": 0,
            "pNext": null
          }
        }
      },
      "pAllocator": null,
      "pMemory": 587
    }
  }
},

I think the problem is in the replay where fd 75 is valid but fd 76 is not. The chromium code includes this:

descriptor.memoryFD = dup(memory_fd_.get());

That is most likely what makes fd 76 point to the same external memory as fd 75.

I'm guessing that gfxreconstruct doesn't call dup() and it doesn't really need to unless it wants to keep some control of the fd (which chromium apparently does). So it seems like getting this application to work would require recording dup() and probably some other system calls to know when this is happening.