wgpu may lock up in ioctl on Linux/Vulkan/Intel

kvark commented 3 years ago

Description Under certain scenarios, we may see a hang in ioctl.

Repro steps Unknown.

Expected vs observed behavior No hangs.

Extra materials Looks like this is described in https://www.reddit.com/r/vulkan/comments/b37762/command_queue_grows_indefinitely_on_intel_gpus/ Edit: actually, no, we aren't expecting vkAcquireNextImageKHR to block. We are explicitly blocking on the fence, which was passed to it, instead.

Platform wgpu master

Bobo1239 commented 3 years ago

Not sure whether this is the same issue but I've recently tracked down something similar in the same environment. (Linux (Wayland)/ Vulkan/Intel) My code is based on the Vulkan tutorial so you can just take e.g. this repo (compile in release mode due to asset loading).

In my case (and also with the linked repo) the issue is always quickly reproducible by having a Youtube running in the background (in another Sway workspace) and switching window focus/workspaces a couple of times. The crux is though that the hang only happens if I set FRAMES_IN_FLIGHT to 1. The issue never happens with 2. Since https://github.com/gfx-rs/wgpu/issues/932 is open I'll assume that wgpu doesn't have multiple frames in flight so that may be the same root cause. Unfortunately I have no idea where to go from here since that's probably a driver bug...

Backtrace on my machine (hang is inside vkQueuePresentKHR()):

[#0] 0x7f6943be559b → ioctl()
[#1] 0x7f6942ead924 → cmp eax, 0xffffffff
[#2] 0x56359223412d → ash::extensions::khr::swapchain::Swapchain::queue_present()
[#3] 0x563591feeb92 → vulkan_tutorial_ash::VulkanApp::draw_frame()
[#4] 0x563591ffd7c0 → _ZN19vulkan_tutorial_ash4main28_$u7b$$u7b$closure$u7d$$u7d$17ha17eb207c13ce696E.llvm.10835609363145980811()
[#5] 0x563592050dc8 → winit::platform_impl::platform::wayland::event_loop::EventLoop<T>::run()
[#6] 0x563591ffa1cc → winit::platform_impl::platform::EventLoop<T>::run()
[#7] 0x563591fe54ea → winit::event_loop::EventLoop<T>::run()
[#8] 0x563591fe5c7f → vulkan_tutorial_ash::main()

bllanos commented 3 years ago

On the current master branch (11d31d537706f69b982465840a25244e469f471a), if I duplicate the section of code in the hello-compute example that creates a command encoder and submits work to the device:

    let mut encoder =
        device.create_command_encoder(&wgpu::CommandEncoderDescriptor { label: None });
    {
        let mut cpass = encoder.begin_compute_pass(&wgpu::ComputePassDescriptor { label: None });
        cpass.set_pipeline(&compute_pipeline);
        cpass.set_bind_group(0, &bind_group, &[]);
        cpass.insert_debug_marker("compute collatz iterations");
        cpass.dispatch(numbers.len() as u32, 1, 1); // Number of cells to run, the (x,y,z) size of item being processed
    }
    // Sets adds copy operation to command encoder.
    // Will copy data from storage buffer on GPU to staging buffer on CPU.
    encoder.copy_buffer_to_buffer(&storage_buffer, 0, &staging_buffer, 0, size);

    // Submits command encoder for processing
    queue.submit(Some(encoder.finish()));

such that it appears twice in sequence in the file, then the example hangs.

In my own project, I found that all tests where I tried to run two (as opposed to one) compute passes on the device would hang at the end of the test, when wgpu resources are being dropped. It seemed like there was a deadlock inside vkDestroyDevice.

I don't know whether this is the same issue or a different issue.

My environment is Linux (Ubuntu 21.04, with Wayland and kernel 5.11.0-25-generic), with the following device adapter info:

AdapterInfo { name: "Intel(R) UHD Graphics (CML GT2)", vendor: 32902, device: 39882, device_type: IntegratedGpu, backend: Vulkan }

bllanos commented 3 years ago

I did some further investigation into my comment above.

First of all, the problem does not occur on the v0.9 branch (commit 0084d68c6079d008502fbb5887c14713fd7a6c8d), so I think it may be a bug in wgpu-hal.

Attached are trace-level logs generated on the master branch (commit 7798534d428100145f10de1fac4bd35c38737806), either for the unmodified version of the hello-compute example (does not hang), or for the version with the modifications I mentioned above (hangs):

The interesting parts of the logs seem to be near the end. In the example that hangs:

There are Vulkan validation errors saying that the application is attempting to reset a command pool with two command buffers that are currently in use. The errors look like:

Validation Error: [ VUID-vkResetCommandPool-commandPool-00040 ] Object 0: handle = 0x563071254d10, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0xb53e2331 | Attempt to reset command pool with VkCommandBuffer 0x563071254d10[] which is in use. The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state (https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html#VUID-vkResetCommandPool-commandPool-00040)

All subsequent Vulkan validation errors seem to indicate that resources cannot be freed because there are still command buffers referring to them. The failure to free resources seems to put the program into an interminable loop of attempted resource release operations.

The command buffers that cannot be freed are those created during the second compute pass, as far as I can tell from using a debugger. There are a few details I do not understand:

Both compute passes are the same, as written in main.rs, but the first once seems to submit 3 command buffers, whereas the second submits 2 command buffers.
In the call to the function that seems to reset the command pool after each compute pass (reset_all), the input list of command buffers seems to contain 2 command buffers (for each pass).

Unfortunately I am not sufficiently familiar with the codebase to know the source of the problem.

bllanos commented 3 years ago

I realise my problem is probably this issue: https://github.com/gfx-rs/wgpu/issues/1689

kvark commented 3 years ago

Thank you for investigation. Your issue is certainly easier to reproduce.

kvark commented 3 years ago

Ok, I don't know how to approach this. Tried on 3 different machines: Linux/NV, Linux/AMD, and Windows/Intel, all on Vulkan, with no luck reproducing any bad behavior of the example as given. @bllanos could you try updating Vulkan driver and the validation layers?

Bobo1239 commented 3 years ago

Just as a datapoint: I can reproduce @bllanos's issue on Linux/Intel (Vulkan AdapterInfo { name: "Intel(R) HD Graphics 520 (SKL GT2)", vendor: 32902, device: 6422, device_type: IntegratedGpu, backend: Vulkan })

bllanos commented 3 years ago

I am not sure how to update the Vulkan validation layers beyond the version provided by the default Ubuntu repositories (vulkan-validationlayers:amd64 1.2.162.0-1). If it is important, I could look into it in more detail.

I updated package mesa-vulkan-drivers:amd64 from version 21.0.3-0ubuntu0.2 to version 21.3~git2108120600.513fb5~oibaf~h using the following PPA: https://launchpad.net/~oibaf/+archive/ubuntu/graphics-drivers. I believe the latest Mesa release is 21.2.0 (https://archive.mesa3d.org//).

The issue still occurs on the latest master branch (f0520f8c5416362f291a3e5a3cbc547918d2b98d) with the updated driver.

cwfitzgerald commented 3 years ago

@bllanos you can update from getting the SDK here https://vulkan.lunarg.com/.

tfgast commented 3 years ago

I am seeing a similar hang in ioctl, although I cannot reproduce bllanos version with hello-compute, so I'm not sure if I'm seeing the same bug. I'm also on linux with the mesa driver. Here's a minimized version:

use winit::{
    event::{Event, WindowEvent},
    event_loop::ControlFlow,
};

struct Framework {
    device: wgpu::Device,
    queue: wgpu::Queue,
    sc_desc: wgpu::SurfaceConfiguration,
    surface: wgpu::Surface,
}

impl Framework {
    async fn new(window: &winit::window::Window) -> Framework {
        let backend = wgpu::util::backend_bits_from_env().unwrap_or(wgpu::Backends::PRIMARY);
        let instance = wgpu::Instance::new(backend);
        let size = window.inner_size();
        let surface = unsafe { instance.create_surface(window) };
        let adapter = wgpu::util::initialize_adapter_from_env_or_default(&instance, backend)
            .await
            .expect("No suitable GPU adapters found on the system!");

        let features = wgpu::Features::default();
        let trace_dir = std::env::var("WGPU_TRACE");
        let limits = adapter.limits();
        let (device, queue) = adapter
            .request_device(
                &wgpu::DeviceDescriptor {
                    features,
                    limits,
                    label: None,
                },
                trace_dir.ok().as_ref().map(std::path::Path::new),
            )
            .await
            .expect("Unable to find a suitable GPU adapter!");
        let format = surface.get_preferred_format(&adapter).unwrap();

        let sc_desc = wgpu::SurfaceConfiguration {
            usage: wgpu::TextureUsages::RENDER_ATTACHMENT,
            format,
            width: size.width,
            height: size.height,
            present_mode: wgpu::PresentMode::Mailbox,
        };

        // Removing this command encoder fixes the hang
        let init_encoder =
            device.create_command_encoder(&wgpu::CommandEncoderDescriptor { label: None });
        queue.submit(Some(init_encoder.finish()));

        Framework {
            device,
            queue,
            sc_desc,
            surface,
        }
    }

    fn handle_event<T>(
        &mut self,
        window: &winit::window::Window,
        event: Event<'_, T>,
        control_flow: &mut ControlFlow,
    ) {
        match event {
            Event::MainEventsCleared => {
                window.request_redraw();
            }
            Event::WindowEvent {
                event: WindowEvent::Resized(size),
                ..
            } => {
                self.sc_desc.width = if size.width == 0 { 1 } else { size.width };
                self.sc_desc.height = if size.height == 0 { 1 } else { size.height };
                self.surface.configure(&self.device, &self.sc_desc)
            }
            Event::WindowEvent {
                event: WindowEvent::CloseRequested,
                ..
            } => *control_flow = ControlFlow::Exit,
            Event::RedrawRequested(_) => {
                let surface = &self.surface;

                let frame = match surface.get_current_frame() {
                    Ok(frame) => frame,
                    Err(_) => {
                        self.surface.configure(&self.device, &self.sc_desc);
                        surface
                            .get_current_frame()
                            .expect("Failed to acquire next surface texture!")
                    }
                };
                let encoder = self
                    .device
                    .create_command_encoder(&wgpu::CommandEncoderDescriptor { label: None });

                self.queue.submit(std::iter::once(encoder.finish()));
                dbg!("Hangs after here");
            }
            _ => {}
        }
        dbg!("Hangs before here");
    }
}

fn main() {
    let event_loop = winit::event_loop::EventLoop::new();
    let mut builder = winit::window::WindowBuilder::new();
    builder = builder.with_title("test");
    #[cfg(windows_OFF)]
    {
        use winit::platform::windows::WindowBuilderExtWindows;
        builder = builder.with_no_redirection_bitmap(true);
    }
    let window = builder.build(&event_loop).unwrap();

    let mut framework = pollster::block_on(Framework::new(&window));
    framework
        .surface
        .configure(&framework.device, &framework.sc_desc);

    event_loop.run(move |event, _, control_flow| {
        *control_flow = if cfg!(feature = "metal-auto-capture") {
            ControlFlow::Exit
        } else {
            ControlFlow::Poll
        };
        framework.handle_event(&window, event, control_flow);
    });
}

bllanos commented 3 years ago

I can reproduce @tfgast's issue (and its workaround, when I comment out the encoder as mentioned in the code sample). @tfgast "my" issue seems more like #1689, but I started the conversation on this thread before I came to that conclusion.

@cwfitzgerald I set up the LunarG SDK as described at https://vulkan.lunarg.com/doc/sdk/1.2.182.0/linux/getting_started.html (skipping the step on copying files to system directories, which I think is not necessary based on the instructions given by ash's author here). I have the following environment variables in my shell startup file:

export VULKAN_SDK=$HOME...vulkan/1.2.182.0/x86_64
export PATH="$VULKAN_SDK/bin:$PATH"
export LD_LIBRARY_PATH="$VULKAN_SDK/lib:$LD_LIBRARY_PATH"
export VK_LAYER_PATH="$VULKAN_SDK/etc/vulkan/explicit_layer.d:$VK_LAYER_PATH"

I re-tested my issue on commit 8f02b73655aff641361822a8ac0347fc47622b49, running

RUST_LOG=trace cargo run --example hello-compute --features trace &> hello_compute.log

to generate the log file in the attachment. I ran cargo clean beforehand, just in case. hello_compute.zip

The attachment includes the example's logging output, trace files, the Rust code I was running (mentioned above), and the output of vkvia and vulkaninfo.

kvark commented 3 years ago

Looks similar to #1878

[2021-08-20T14:05:24Z ERROR wgpu_hal::vulkan::instance] VALIDATION [VUID-vkResetCommandPool-commandPool-00040 (0xb53e2331)] Validation Error: [ VUID-vkResetCommandPool-commandPool-00040 ] Object 0: handle = 0x55b56e35f690, name = _Transit, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0xb53e2331 | Attempt to reset command pool with VkCommandBuffer 0x55b56e35f690[_Transit] which is in use. The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state (https://vulkan.lunarg.com/doc/view/1.2.182.0/linux/1.2-extensions/vkspec.html#VUID-vkResetCommandPool-commandPool-00040)

kvark commented 3 years ago

For anyone who can reproduce this, do you have a dual-GPU configuration with NVidia by any chance?

kvark commented 3 years ago

I wonder if it's related to https://github.com/gfx-rs/wgpu/pull/1898

tfgast commented 3 years ago

I do have a dual-GPU with NVidia.

bllanos commented 3 years ago

I only have Intel integrated graphics.

kocsis1david commented 3 years ago

I only have integrated gpu, and it's possible to reproduce it by changing the hello-compute example, I only need to submit an empty command buffer before execute_gpu_inner. The validation errors happen after 5 seconds, which is the CLEANUP_WAIT_MS.

mitchmindtree commented 3 years ago

I think I might have run into this while updating conrod_wgpu's wgpu dep from 0.9 to 0.10. Specifically, the hang occurs at the end of the first RedrawRequested event during the Drop implementation for SurfaceFrame.

Here's the backtrace from gdb at the moment of hanging:

(gdb) backtrace
#0  0x00007ffff7d1fb07 in ioctl () from /nix/store/gk42f59363p82rg2wv2mfy71jn5w4q4c-glibc-2.32-48/lib/libc.so.6
#1  0x00007fffe2a956c0 in anv_gem_syncobj_timeline_wait ()
   from /nix/store/85hbpjblyvgg9k9vvirqk69r8qb1k5dl-mesa-21.1.4-drivers/lib/libvulkan_intel.so
#2  0x00007fffe2ad2d19 in anv_QueuePresentKHR ()
   from /nix/store/85hbpjblyvgg9k9vvirqk69r8qb1k5dl-mesa-21.1.4-drivers/lib/libvulkan_intel.so
#3  0x0000555555feed00 in ash::vk::extensions::KhrSwapchainFn::queue_present_khr (self=0x55555710f280, queue=...,
    p_present_info=0x7ffffffe6f60)
    at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/ash-0.33.3+1.2.191/src/vk/extensions.rs:566
#4  0x0000555555fe2acd in ash::extensions::khr::swapchain::Swapchain::queue_present (self=0x55555710f278, queue=...,
    create_info=0x7ffffffe6f60)
    at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/ash-0.33.3+1.2.191/src/extensions/khr/swapchain.rs:91
#5  0x0000555555eeceac in wgpu_hal::vulkan::{{impl}}::present (self=0x55555710f270, surface=0x555556e2e840,
    texture=...)
    at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/wgpu-hal-0.10.4/src/vulkan/mod.rs:531
#6  0x0000555555b474e7 in wgpu_core::hub::Global<wgpu_core::hub::IdentityManagerFactory>::surface_present<wgpu_core::hub::IdentityManagerFactory,wgpu_hal::vulkan::Api> (self=0x555556e29640, surface_id=...)
    at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/wgpu-core-0.10.2/src/present.rs:243
#7  0x0000555555c3169b in wgpu::backend::direct::{{impl}}::surface_present (self=0x555556e29640,
    texture=0x7ffffffe8040, detail=0x7ffffffe8058)
    at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/wgpu-0.10.1/src/backend/direct.rs:929
#8  0x0000555555cac720 in wgpu::{{impl}}::drop (self=0x7ffffffe8038)
    at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/wgpu-0.10.1/src/lib.rs:3069
#9  0x0000555555c600e7 in core::ptr::drop_in_place<wgpu::SurfaceTexture> ()
    at /nix/store/r218w4jqf2yl6whglfpq0kz61yjn1jhz-rust-default-1.53.0/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:192
#10 0x0000555555767f8b in core::ptr::drop_in_place<wgpu::SurfaceFrame> ()
    at /nix/store/r218w4jqf2yl6whglfpq0kz61yjn1jhz-rust-default-1.53.0/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:192
#11 0x0000555555772f31 in all_winit_wgpu::main::{{closure}} (event=..., control_flow=0x7ffffffe8b40)
    at backends/conrod_wgpu/examples/all_winit_wgpu.rs:266
#12 0x00005555557b921e in winit::platform_impl::platform::sticky_exit_callback<(),closure-0> (evt=...,
    target=0x555556d86970, control_flow=0x7ffffffe8b40, callback=0x7ffffffe9338)
    at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/winit-0.25.0/src/platform_impl/linux/mod.rs:746
#13 0x000055555578ec84 in winit::platform_impl::platform::x11::EventLoop<()>::run_return<(),closure-0> (
    self=0x7ffffffe9ef0, callback=...)
    at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/winit-0.25.0/src/platform_impl/linux/x11/mod.rs:307
#14 0x000055555578fbf3 in winit::platform_impl::platform::x11::EventLoop<()>::run<(),closure-0> (self=...,
    callback=...)
    at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/winit-0.25.0/src/platform_impl/linux/x11/mod.rs:385
#15 0x00005555557b9086 in winit::platform_impl::platform::EventLoop<()>::run<(),closure-0> (self=..., callback=...)
    at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/winit-0.25.0/src/platform_impl/linux/mod.rs:662
#16 0x0000555555780fac in winit::event_loop::EventLoop<()>::run<(),closure-0> (self=..., event_handler=...)
    at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/winit-0.25.0/src/event_loop.rs:154
#17 0x00005555557718ae in all_winit_wgpu::main () at backends/conrod_wgpu/examples/all_winit_wgpu.rs:111

Here's the WIP PR of the update: https://github.com/PistonDevelopers/conrod/pull/1436. No major changes other than replacing the old GLSL and pre-compiled SPIR-V shaders with WGSL (translated from the old GLSL shaders using the current naga-cli).

I tried running with validation layers enabled like so:

VK_LAYER_KHRONOS_validation=1 RUST_BACKTRACE=1 cargo run --example all_winit_wgpu

though received no extra output. I'm not 100% I have these installed though! Anyone know offhand if there's an easy way to check on NixOS?

I also tried each of the different present modes in case something other than FIFO worked, but no luck there.

I'm on NixOS + Gnome + Wayland + Intel Xe Graphics (only integrated).

kvark commented 3 years ago

This definitely looks related to #1673. @mitchmindtree could you run wgpu-rs examples from master on your system?

mitchmindtree commented 3 years ago

Yes the examples on master appear to work well, I also tried with the commit that published wgpu 0.10.1 (what I'm updating to in conrod_wgpu) and it seems they still worked then too. I wonder what we're doing to trigger this in the conrod_wgpu example...

kvark commented 3 years ago

Could you be making multiple submissions per frame? Try making a single submission, just for experiment.

mitchmindtree commented 3 years ago

Yep that seems to be it! There's an extra submission before the event loop runs to load the single image that's used in the example - if I remove that submission and let the command be submitted along with the rest of the first frame's command buffer (so that there's only one submission), the example seems to run perfectly.

-------- Original Message -------- On Sep 12, 2021, 01:19, Dzmitry Malyshau wrote:

Could you be making multiple submissions per frame? Try making a single submission, just for experiment.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

kvark commented 3 years ago

I have Intel Xe Graphics, and I see this issue myself.

kvark commented 2 years ago

Considering this fixed by #2212

Edit: to clarify, #2212 is a workaround for systems that haven't updated to https://gitlab.freedesktop.org/mesa/mesa/-/issues/5508 And it's not working very well. There is still some race condition in the driver, but at least it shows a few frames.

gfx-rs / wgpu

wgpu may lock up in ioctl on Linux/Vulkan/Intel #1672