JuliaGPU / Vulkan.jl

Using Vulkan from Julia
Other
109 stars 12 forks source link

Usage with RenderDoc #53

Closed fatteneder closed 7 hours ago

fatteneder commented 2 weeks ago

I was looking into how I could debug a Vulkan.jl app using RenderDoc.

So far I did not manage to get it to work. But here is what I have tried:

Following the quickstart section, I provided as an executable path a path to juliaup/bin/julia, and use something like --project=<path-to-project-dir> -e "using VulkanTutorial; main()" as command line arguments. This is enough to launch the app from within RenderDoc, however, the advertised in-app overlay is not visible, and upon trying to capture (using F12) nothing happens.

Following their FAQ I did some more configuring and debugging:

Atm I suspect the problem to be related to Julia not being linked to Vulkan directly, but instead we load i through Vulkan.jl. Unfortunately, the docs on internals of RenderDoc skip the part of how RenderDoc hooks itself up with the driver and application.

@serenity4 Do you have any ideas on how we could get this working?

The author of RenderDoc mentions in various places that he is happy to help, so I might also open an issue with RenderDoc and ask for details on the injection. But for that I need to craft a MWE Vulkan.jl app first.

serenity4 commented 2 weeks ago

Hi @fatteneder, thanks for the detailed description of the issue!

I haven't tried using RenderDoc myself, so I won't be able to bring any experience here, but hopefully I can help clarify how relevant internals of Vulkan.jl work to help link with RenderDoc.

Atm I suspect the problem to be related to Julia not being linked to Vulkan directly, but instead we load i through Vulkan.jl.

FYI, we dlopen the library during initialization, see here: https://github.com/JuliaGPU/VulkanCore.jl/blob/a7eb8426f6a16b258c5aa1589cb12c23cd97cb5f/src/LibVulkan.jl#L8-L31

Not sure if that sheds any light, but essentially all we do is grab the libvulkan.so library during runtime, which is in most cases going to be a Vulkan loader, and then use it as any application would to communicate with specific drivers e.g. when listing/using devices. Any configuration with regards to additional layers or extensions should not differ from any C library.

Something that (very rarely) messed up with the interfacing with Vulkan for me was this: https://juliagpu.github.io/Vulkan.jl/dev/troubleshooting/#Internal-API-errors, perhaps you can use LD_PRELOAD with a recent C++ implementation as shown here: https://juliagpu.github.io/Vulkan.jl/dev/troubleshooting/#libstdc. I don't think it would make much difference but it might be worth a try if other solutions don't work.

Or perhaps use LD_PRELOAD to preload librender.so? Or open it at runtime using Libdl.dlopen? Don't know if that would make any difference, but from your description and the RenderDoc docs it looks like if RenderDoc gets detected it should take care of the issue.

Furthermore were you able to detect RenderDoc as a Vulkan layer at runtime? If not could it be a layer configuration issue?

fatteneder commented 2 weeks ago

Thanks for taking a look and your suggestions.

... perhaps you can use LD_PRELOAD with a recent C++ implementation as shown here: ...

I tried this with libstdc++.so.6 and librender.so, however, LD_PRELOAD ends up empty inside Julia. So I seems something is overriding that one, but I am not sure if its RenderDoc.

dlopening librender.so also did not help.

Atm I suspect the problem to be related to Julia not being linked to Vulkan directly, but instead we load i through Vulkan.jl

Not sure if that sheds any light, but essentially all we do is grab the libvulkan.so library during runtime, which is in most cases going to be a Vulkan loader, and then use it as any application would to communicate with specific drivers e.g. when listing/using devices.

My idea was that RenderDoc might just look at the binary dependencies of the provided executable (using ldd) and then patch any libvulkan.so before running. However, both julia and any pkg image ~/.julia/compiled/v1.10/Vulkan/*.so only have basic deps, e.g.

~|⇒ ldd ~/.julia/compiled/v1.10/Vulkan/*.so
~/.julia/compiled/v1.10/Vulkan/fwQqd_bEdqV.so:
    linux-vdso.so.1 (0x00007fff6c1fb000)
    /usr/lib64/libstdc++.so.6 (0x00007fcdd0600000)
    libjulia.so.1.10 => not found
    libjulia-internal.so.1.10 => not found
    libm.so.6 => /lib64/libm.so.6 (0x00007fcdd37c9000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fcdd0200000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fcdd38c7000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fcdd37a7000)
...

But this makes sense if libvulkan.so is loaded in Vulkan.jl's __init__.

fatteneder commented 2 weeks ago

Furthermore were you able to detect RenderDoc as a Vulkan layer at runtime? If not could it be a layer configuration issue?

Yes, there is one entry.

julia> for l in enumerate_instance_layer_properties() |> unwrap
           display(l)
       end
LayerProperties("VK_LAYER_VALVE_steam_overlay_32", v"1.3.207", v"0.0.1", "Steam Overlay Layer")
LayerProperties("VK_LAYER_VALVE_steam_overlay_64", v"1.3.207", v"0.0.1", "Steam Overlay Layer")
LayerProperties("VK_LAYER_VALVE_steam_fossilize_32", v"1.3.207", v"0.0.1", "Steam Pipeline Caching Layer")
LayerProperties("VK_LAYER_VALVE_steam_fossilize_64", v"1.3.207", v"0.0.1", "Steam Pipeline Caching Layer")
LayerProperties("VK_LAYER_MESA_device_select", v"1.3.211", v"0.0.1", "Linux device selection layer")
LayerProperties("VK_LAYER_RENDERDOC_Capture", v"1.2.131", v"0.0.17", "Debugging capture layer for RenderDoc")
LayerProperties("VK_LAYER_KHRONOS_validation", v"1.3.204", v"0.0.1", "Khronos Validation Layer")

EDIT: But trying to enable it makes Julia hang at create_instance. Their docs say nothing about this layer, so it might not be necessary to enable it.

EDIT 2: After removing the renderdoc layer request I am now receiving the following warning:

[ Info: General (Loader Message): loader_add_implicit_layer: Disabling implicit layer VK_LAYER_RENDERDOC_Capture for using an old API version 1.2 versus application requested 1.3

Does that mean Vulkan.jl uses v1.2 implicilty somewhere?

serenity4 commented 2 weeks ago

After reading up a bit more through the RenderDoc documentation, I wouldn't be surprised if the only thing to do is enable that layer - and that launching the executable through RenderDoc's application would merely set VK_LOADER_LAYERS_ENABLE as documented here: https://github.com/KhronosGroup/Vulkan-Loader/blob/main/docs/LoaderInterfaceArchitecture.md#active-environment-variables. I may be wrong though.

The message being layer-specific makes me think that RenderDoc requires a 1.2 Vulkan API. Do you have the latest version installed? Perhaps you can try running your application on 1.2 as you originally did?

Vulkan.jl does not make any choices with respect to the version used, the only possible issue you should run into is that the bindings are not always updated to the latest API version but they have been on 1.3 for a while already. (if anything Vulkan.jl does restricts usage in any way, I'd consider that a bug)

fatteneder commented 2 weeks ago

After reading up a bit more through the RenderDoc documentation, I wouldn't be surprised if the only thing to do is enable that layer - and that launching the executable through RenderDoc's application would merely set VK_LOADER_LAYERS_ENABLE as documented here: https://github.com/KhronosGroup/Vulkan-Loader/blob/main/docs/LoaderInterfaceArchitecture.md#active-environment-variables. I may be wrong though.

The env variable is not set when running under RenderDoc.

The message being layer-specific makes me think that RenderDoc requires a 1.2 Vulkan API. Do you have the latest version installed? Perhaps you can try running your application on 1.2 as you originally did?

I have RenderDoc 1.17, which is from 2021, but that is what Fedora's dnf provided. Maybe I should build the latest version from source and try again.

My OS's Vulkan version is 1.3.204.

vulkaninfo --summary

``` ~|⇒ vulkaninfo --summary MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0 WARNING: lavapipe is not a conformant vulkan implementation, testing use only. ========== VULKANINFO ========== Vulkan Instance Version: 1.3.204 Instance Extensions: count = 19 ------------------------------- VK_EXT_acquire_drm_display : extension revision 1 VK_EXT_acquire_xlib_display : extension revision 1 VK_EXT_debug_report : extension revision 10 VK_EXT_debug_utils : extension revision 2 VK_EXT_direct_mode_display : extension revision 1 VK_EXT_display_surface_counter : extension revision 1 VK_KHR_device_group_creation : extension revision 1 VK_KHR_display : extension revision 23 VK_KHR_external_fence_capabilities : extension revision 1 VK_KHR_external_memory_capabilities : extension revision 1 VK_KHR_external_semaphore_capabilities : extension revision 1 VK_KHR_get_display_properties2 : extension revision 1 VK_KHR_get_physical_device_properties2 : extension revision 2 VK_KHR_get_surface_capabilities2 : extension revision 1 VK_KHR_surface : extension revision 25 VK_KHR_surface_protected_capabilities : extension revision 1 VK_KHR_wayland_surface : extension revision 6 VK_KHR_xcb_surface : extension revision 6 VK_KHR_xlib_surface : extension revision 6 Instance Layers: count = 7 -------------------------- VK_LAYER_KHRONOS_validation Khronos Validation Layer 1.3.204 version 1 VK_LAYER_MESA_device_select Linux device selection layer 1.3.211 version 1 VK_LAYER_RENDERDOC_Capture Debugging capture layer for RenderDoc 1.2.131 version 17 VK_LAYER_VALVE_steam_fossilize_32 Steam Pipeline Caching Layer 1.3.207 version 1 VK_LAYER_VALVE_steam_fossilize_64 Steam Pipeline Caching Layer 1.3.207 version 1 VK_LAYER_VALVE_steam_overlay_32 Steam Overlay Layer 1.3.207 version 1 VK_LAYER_VALVE_steam_overlay_64 Steam Overlay Layer 1.3.207 version 1 Devices: ======== GPU0: apiVersion = 4206803 (1.3.211) driverVersion = 92278791 (0x5801007) vendorID = 0x8086 deviceID = 0x9a49 deviceType = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU deviceName = Intel(R) Xe Graphics (TGL GT2) driverID = DRIVER_ID_INTEL_OPEN_SOURCE_MESA driverName = Intel open-source Mesa driver driverInfo = Mesa 22.1.7 conformanceVersion = 1.3.0.0 deviceUUID = ff258cf4-4865-a82b-e58f-77bffa8e3040 driverUUID = 4bb42d09-c7a1-618d-f1d1-05403a44aa75 GPU1: apiVersion = 4206803 (1.3.211) driverVersion = 1 (0x0001) vendorID = 0x10005 deviceID = 0x0000 deviceType = PHYSICAL_DEVICE_TYPE_CPU deviceName = llvmpipe (LLVM 14.0.0, 256 bits) driverID = DRIVER_ID_MESA_LLVMPIPE driverName = llvmpipe driverInfo = Mesa 22.1.7 (LLVM 14.0.0) conformanceVersion = 1.3.1.1 deviceUUID = 6d657361-3232-2e31-2e37-000000000000 driverUUID = 6c6c766d-7069-7065-5555-494400000000 ~|⇒ ```

There is now also an odd issue with switching API versions. I have removed the RenderDoc layer for the moment, however, the app still hangs at create_instance, independent of which API version I request. Restarting the REPL also does not help. Only after wiping the associated pkg image cache of my app it works again, until I change the API version, and even after REPL restart. But this might be separate issue.

serenity4 commented 2 weeks ago

VK_LAYER_RENDERDOC_Capture Debugging capture layer for RenderDoc 1.2.131 version 17

Yep, that one won't support Vulkan 1.3.

For that other issue I suggest to enable full logging on the Vulkan side with validation layers, usually it provides good info on what goes wrong (if not, it might be an implementation issue), see VK_LOADER_DEBUG at https://github.com/KhronosGroup/Vulkan-Loader/blob/main/docs/LoaderInterfaceArchitecture.md.

fatteneder commented 2 weeks ago

Yep, that one won't support Vulkan 1.3.

Good catch. I have build now from source and configured the RenderDoc layer so that vulkaninfo --summary now shows

VK_LAYER_RENDERDOC_Capture        Debugging capture layer for RenderDoc 1.3.131  version 34

For that other issue I suggest to enable full logging on the Vulkan side with validation layers, usually it provides good info on what goes wrong (if not, it might be an implementation issue), see VK_LOADER_DEBUG at https://github.com/KhronosGroup/Vulkan-Loader/blob/main/docs/LoaderInterfaceArchitecture.md.

That's a good tip. Before I only had the logging provided by create_debug_utils_messenger_ext as given in your tutorial. I am using VK_LOADER_DEBUG=all now.

Unfortunately, using the local build does not work either. The only RenderDoc related message I see is about some deprecated Vulkan method.

INFO: Layer "VK_LAYER_RENDERDOC_Capture" using deprecated 'vkGetDeviceProcAddr' tag which was deprecated starting with JSON file version 1.1.0. The new vkNegotiateLoaderLay
erInterfaceVersion function is preferred, though for compatibility reasons it may be desirable to continue using the deprecated tag.
fatteneder commented 2 weeks ago

Re hanging in create_instance: This seems to have been caused by me dlopening librenderdoc.so within my app's __init__.

fatteneder commented 2 weeks ago

Enabling the RenderDoc layer again now makes create_instance fail, and I see the following logs

LAYER | DEBUG: Loading layer library /home/florian/wd/graphics/renderdoc/build/lib/librenderdoc.so
LAYER | INFO: Insert instance layer VK_LAYER_RENDERDOC_Capture (/home/florian/wd/graphics/renderdoc/build/lib/librenderdoc.so)
LAYER | DEBUG: Loading layer library libVkLayer_khronos_validation.so
LAYER | INFO: Insert instance layer VK_LAYER_KHRONOS_validation (libVkLayer_khronos_validation.so)
LAYER | DEBUG: Loading layer library libVkLayer_MESA_device_select.so
LAYER | INFO: Insert instance layer VK_LAYER_MESA_device_select (libVkLayer_MESA_device_select.so)
LAYER: vkCreateInstance layer callstack setup to:
LAYER:    <Application>
LAYER:      ||
LAYER:    <Loader>
LAYER:      ||
LAYER:    VK_LAYER_MESA_device_select
LAYER:            Type: Implicit
LAYER:                Disable Env Var:  NODEVICE_SELECT
LAYER:            Manifset: /usr/share/vulkan/implicit_layer.d/VkLayer_MESA_device_select.json
LAYER:            Library:  libVkLayer_MESA_device_select.so
LAYER:      ||
LAYER:    VK_LAYER_KHRONOS_validation
LAYER:            Type: Explicit
LAYER:            Manifset: /usr/share/vulkan/explicit_layer.d/VkLayer_khronos_validation.json
LAYER:            Library:  libVkLayer_khronos_validation.so
LAYER:      ||
LAYER:    VK_LAYER_RENDERDOC_Capture
LAYER:            Type: Implicit
LAYER:                Disable Env Var:  DISABLE_VULKAN_RENDERDOC_CAPTURE_1_34
LAYER:            Manifset: /etc/vulkan/implicit_layer.d/renderdoc_capture.json
LAYER:            Library:  /home/florian/wd/graphics/renderdoc/build/lib/librenderdoc.so
LAYER:      ||
LAYER:    <Drivers>

LAYER | DEBUG: Unloading layer library libVkLayer_MESA_device_select.so
LAYER | DEBUG: Unloading layer library libVkLayer_khronos_validation.so
LAYER | DEBUG: Unloading layer library /home/florian/wd/graphics/renderdoc/build/lib/librenderdoc.so
ERROR: ERROR_INITIALIZATION_FAILED: failed to execute #= /home/florian/.julia/packages/Vulkan/1Bx5C/generated/linux.jl:58548 =# @dispatch nothing vkCreateInstance(create_in
fo, allocator, pInstance)
Stacktrace:
 [1] unwrap
   @ ~/.julia/packages/ResultTypes/AUZ9z/src/ResultTypes.jl:67 [inlined]

There are also some errors in the logs, but they might not be related to RenderDoc

INFO: /usr/lib/libvulkan_intel.so: wrong ELF class: ELFCLASS32
ERROR | DRIVER: loader_icd_scan: Failed to add ICD JSON /usr/lib/libvulkan_intel.so.  Skipping ICD JSON.
INFO: Found ICD manifest file /usr/share/vulkan/icd.d/lvp_icd.i686.json, version "1.0.0"
DEBUG: Searching for ICD drivers named /usr/lib/libvulkan_lvp.so
INFO: /usr/lib/libvulkan_lvp.so: wrong ELF class: ELFCLASS32
ERROR | DRIVER: loader_icd_scan: Failed to add ICD JSON /usr/lib/libvulkan_lvp.so.  Skipping ICD JSON.
INFO: Found ICD manifest file /usr/share/vulkan/icd.d/radeon_icd.i686.json, version "1.0.0"
DEBUG: Searching for ICD drivers named /usr/lib/libvulkan_radeon.so
INFO: /usr/lib/libvulkan_radeon.so: wrong ELF class: ELFCLASS32
ERROR | DRIVER: loader_icd_scan: Failed to add ICD JSON /usr/lib/libvulkan_radeon.so.  Skipping ICD JSON.
DEBUG: Build ICD instance extension list
serenity4 commented 2 weeks ago

Hmm, I don't have any clues as to what's happening here. These logs indeed seem to be fine, these "errors" are very minor and unrelated as it relates to device driver configurations and not to the loader (which is responsible for managing Instances).

I sometimes faced issues with initialization failure that were fixed with a reboot, typically I'd also have these failures running vulkaninfo (presumably due to some driver state being incorrect). And I've had other errors come with validation layers enabled, you can try disabling them there to see whether it may be related to them or not.

serenity4 commented 2 weeks ago

Other than that, if everything works fine without RenderDoc, but doesn't with the layer on, that might hint at some form of incompatibility issue, but I'm mainly speculating.

fatteneder commented 2 weeks ago

Neither enabling the layer manually or forcing it through LD_PRELOAD is supported or documented: https://github.com/baldurk/renderdoc/issues/3301#issuecomment-2090985133

Instead, we always need to go through the GUI.

serenity4 commented 2 weeks ago

That's a bummer, I would expect from such a tool to have some form of programmatic interface :slightly_frowning_face:

Were you able to use the GUI successfully with the command you originally attempted, with the local build of RenderDoc?

fatteneder commented 2 weeks ago

That's a bummer, I would expect from such a tool to have some form of programmatic interface 🙁

I think there is a python API to do replaying or something like that.

Were you able to use the GUI successfully with the command you originally attempted, with the local build of RenderDoc?

No. It gives met the same boring logs without hints.

serenity4 commented 2 weeks ago

I think there is a python API to do replaying or something like that.

Indeed. I haven't looked into it but I would have rather expected something easier to interface with, given the popularity of RenderDoc.

No. It gives met the same boring logs without hints.

I'm not sure then that RenderDoc would even function with Julia. If it uses binary introspection on the provided executable, it might not work given the fact that the executable must be Julia itself, with no linking to libvulkan.so or anything prior to the application running.

If you have the courage to keep investigating, I'd be happy to bounce ideas around, though at this stage I wouldn't know where to begin (besides perhaps diving into RenderDoc internals to see what they do when launching the application via the UI).

serenity4 commented 1 day ago

I tried using RenderDoc with my app, and I have the exact same situation as you:

I tried with the GUI, and with renderdoccmd capture julia --project=<path-to-project> test/runtests.jl, same thing.

I'll leave it there for now, since I don't need RenderDoc at the moment, but I thought I'd give it a try and see if I can replicate the issue for future reference.

fatteneder commented 10 hours ago

If you have the courage to keep investigating, I'd be happy to bounce ideas around, though at this stage I wouldn't know where to begin (besides perhaps diving into RenderDoc internals to see what they do when launching the application via the UI).

I had done some more investigations two weeks ago, but because that was without success, I haven't reported back.

If it uses binary introspection on the provided executable, ...

At least on Linux it seems to use ptrace to get control over the exe and setup the magic tricks. My guess as to why this is done in this 'complicated' fashion, is that the app wants to also gather statistic about allocations or things like that, and also wants to support multiple graphics APIs.

The problem I see with that setup in combination with Julia is that Julia's main is not the user's main where the real Vulkan stuff happens. In an ideal scenario, we would like to use Renderdoc with the user's main, which I think might make also for a nicer experience in terms of REPL usage.

Anyways, atm I don't know how to further proceed with this. Perhaps its possible to write a shim around RenderDoc which provides a separate 'interception' method, but I would not know right now how to start with that.

serenity4 commented 7 hours ago

Thanks, this is insightful. I'm not sure either what could be done, it seems to me like we would need to submit changes to RenderDoc (which I'm not sure the author would accept, given that we are probably a very niche case) or perform extensive hacks around the way RenderDoc works externally, which may be brittle at best.

We seem to have come to the conclusion that RenderDoc isn't supported with Julia at the moment, therefore I think we can close this issue and I'll update the docs to reflect that. If anyone wants to dive in further, they are free to do so, but that sounds like a lot of work in any case to attempt to make it work.