Closed kvark closed 2 years ago
Not sure whether this is the same issue but I've recently tracked down something similar in the same environment. (Linux (Wayland)/ Vulkan/Intel) My code is based on the Vulkan tutorial so you can just take e.g. this repo (compile in release
mode due to asset loading).
In my case (and also with the linked repo) the issue is always quickly reproducible by having a Youtube running in the background (in another Sway workspace) and switching window focus/workspaces a couple of times. The crux is though that the hang only happens if I set FRAMES_IN_FLIGHT
to 1
. The issue never happens with 2
. Since https://github.com/gfx-rs/wgpu/issues/932 is open I'll assume that wgpu doesn't have multiple frames in flight so that may be the same root cause. Unfortunately I have no idea where to go from here since that's probably a driver bug...
Backtrace on my machine (hang is inside vkQueuePresentKHR()
):
[#0] 0x7f6943be559b → ioctl()
[#1] 0x7f6942ead924 → cmp eax, 0xffffffff
[#2] 0x56359223412d → ash::extensions::khr::swapchain::Swapchain::queue_present()
[#3] 0x563591feeb92 → vulkan_tutorial_ash::VulkanApp::draw_frame()
[#4] 0x563591ffd7c0 → _ZN19vulkan_tutorial_ash4main28_$u7b$$u7b$closure$u7d$$u7d$17ha17eb207c13ce696E.llvm.10835609363145980811()
[#5] 0x563592050dc8 → winit::platform_impl::platform::wayland::event_loop::EventLoop<T>::run()
[#6] 0x563591ffa1cc → winit::platform_impl::platform::EventLoop<T>::run()
[#7] 0x563591fe54ea → winit::event_loop::EventLoop<T>::run()
[#8] 0x563591fe5c7f → vulkan_tutorial_ash::main()
On the current master branch (11d31d537706f69b982465840a25244e469f471a), if I duplicate the section of code in the hello-compute
example that creates a command encoder and submits work to the device:
let mut encoder =
device.create_command_encoder(&wgpu::CommandEncoderDescriptor { label: None });
{
let mut cpass = encoder.begin_compute_pass(&wgpu::ComputePassDescriptor { label: None });
cpass.set_pipeline(&compute_pipeline);
cpass.set_bind_group(0, &bind_group, &[]);
cpass.insert_debug_marker("compute collatz iterations");
cpass.dispatch(numbers.len() as u32, 1, 1); // Number of cells to run, the (x,y,z) size of item being processed
}
// Sets adds copy operation to command encoder.
// Will copy data from storage buffer on GPU to staging buffer on CPU.
encoder.copy_buffer_to_buffer(&storage_buffer, 0, &staging_buffer, 0, size);
// Submits command encoder for processing
queue.submit(Some(encoder.finish()));
such that it appears twice in sequence in the file, then the example hangs.
In my own project, I found that all tests where I tried to run two (as opposed to one) compute passes on the device would hang at the end of the test, when wgpu resources are being dropped. It seemed like there was a deadlock inside vkDestroyDevice
.
I don't know whether this is the same issue or a different issue.
My environment is Linux (Ubuntu 21.04, with Wayland and kernel 5.11.0-25-generic), with the following device adapter info:
AdapterInfo { name: "Intel(R) UHD Graphics (CML GT2)", vendor: 32902, device: 39882, device_type: IntegratedGpu, backend: Vulkan }
I did some further investigation into my comment above.
First of all, the problem does not occur on the v0.9
branch (commit 0084d68c6079d008502fbb5887c14713fd7a6c8d), so I think it may be a bug in wgpu-hal.
Attached are trace-level logs generated on the master branch (commit 7798534d428100145f10de1fac4bd35c38737806), either for the unmodified version of the hello-compute
example (does not hang), or for the version with the modifications I mentioned above (hangs):
The interesting parts of the logs seem to be near the end. In the example that hangs:
Validation Error: [ VUID-vkResetCommandPool-commandPool-00040 ] Object 0: handle = 0x563071254d10, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0xb53e2331 | Attempt to reset command pool with VkCommandBuffer 0x563071254d10[] which is in use. The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state (https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html#VUID-vkResetCommandPool-commandPool-00040)
The command buffers that cannot be freed are those created during the second compute pass, as far as I can tell from using a debugger. There are a few details I do not understand:
main.rs
, but the first once seems to submit 3 command buffers, whereas the second submits 2 command buffers.reset_all
), the input list of command buffers seems to contain 2 command buffers (for each pass).Unfortunately I am not sufficiently familiar with the codebase to know the source of the problem.
I realise my problem is probably this issue: https://github.com/gfx-rs/wgpu/issues/1689
Thank you for investigation. Your issue is certainly easier to reproduce.
Ok, I don't know how to approach this. Tried on 3 different machines: Linux/NV, Linux/AMD, and Windows/Intel, all on Vulkan, with no luck reproducing any bad behavior of the example as given. @bllanos could you try updating Vulkan driver and the validation layers?
Just as a datapoint: I can reproduce @bllanos's issue on Linux/Intel (Vulkan AdapterInfo { name: "Intel(R) HD Graphics 520 (SKL GT2)", vendor: 32902, device: 6422, device_type: IntegratedGpu, backend: Vulkan }
)
I am not sure how to update the Vulkan validation layers beyond the version provided by the default Ubuntu repositories (vulkan-validationlayers:amd64 1.2.162.0-1
). If it is important, I could look into it in more detail.
I updated package mesa-vulkan-drivers:amd64
from version 21.0.3-0ubuntu0.2
to version 21.3~git2108120600.513fb5~oibaf~h
using the following PPA: https://launchpad.net/~oibaf/+archive/ubuntu/graphics-drivers. I believe the latest Mesa release is 21.2.0 (https://archive.mesa3d.org//).
The issue still occurs on the latest master branch (f0520f8c5416362f291a3e5a3cbc547918d2b98d) with the updated driver.
@bllanos you can update from getting the SDK here https://vulkan.lunarg.com/.
I am seeing a similar hang in ioctl, although I cannot reproduce bllanos version with hello-compute, so I'm not sure if I'm seeing the same bug. I'm also on linux with the mesa driver. Here's a minimized version:
use winit::{
event::{Event, WindowEvent},
event_loop::ControlFlow,
};
struct Framework {
device: wgpu::Device,
queue: wgpu::Queue,
sc_desc: wgpu::SurfaceConfiguration,
surface: wgpu::Surface,
}
impl Framework {
async fn new(window: &winit::window::Window) -> Framework {
let backend = wgpu::util::backend_bits_from_env().unwrap_or(wgpu::Backends::PRIMARY);
let instance = wgpu::Instance::new(backend);
let size = window.inner_size();
let surface = unsafe { instance.create_surface(window) };
let adapter = wgpu::util::initialize_adapter_from_env_or_default(&instance, backend)
.await
.expect("No suitable GPU adapters found on the system!");
let features = wgpu::Features::default();
let trace_dir = std::env::var("WGPU_TRACE");
let limits = adapter.limits();
let (device, queue) = adapter
.request_device(
&wgpu::DeviceDescriptor {
features,
limits,
label: None,
},
trace_dir.ok().as_ref().map(std::path::Path::new),
)
.await
.expect("Unable to find a suitable GPU adapter!");
let format = surface.get_preferred_format(&adapter).unwrap();
let sc_desc = wgpu::SurfaceConfiguration {
usage: wgpu::TextureUsages::RENDER_ATTACHMENT,
format,
width: size.width,
height: size.height,
present_mode: wgpu::PresentMode::Mailbox,
};
// Removing this command encoder fixes the hang
let init_encoder =
device.create_command_encoder(&wgpu::CommandEncoderDescriptor { label: None });
queue.submit(Some(init_encoder.finish()));
Framework {
device,
queue,
sc_desc,
surface,
}
}
fn handle_event<T>(
&mut self,
window: &winit::window::Window,
event: Event<'_, T>,
control_flow: &mut ControlFlow,
) {
match event {
Event::MainEventsCleared => {
window.request_redraw();
}
Event::WindowEvent {
event: WindowEvent::Resized(size),
..
} => {
self.sc_desc.width = if size.width == 0 { 1 } else { size.width };
self.sc_desc.height = if size.height == 0 { 1 } else { size.height };
self.surface.configure(&self.device, &self.sc_desc)
}
Event::WindowEvent {
event: WindowEvent::CloseRequested,
..
} => *control_flow = ControlFlow::Exit,
Event::RedrawRequested(_) => {
let surface = &self.surface;
let frame = match surface.get_current_frame() {
Ok(frame) => frame,
Err(_) => {
self.surface.configure(&self.device, &self.sc_desc);
surface
.get_current_frame()
.expect("Failed to acquire next surface texture!")
}
};
let encoder = self
.device
.create_command_encoder(&wgpu::CommandEncoderDescriptor { label: None });
self.queue.submit(std::iter::once(encoder.finish()));
dbg!("Hangs after here");
}
_ => {}
}
dbg!("Hangs before here");
}
}
fn main() {
let event_loop = winit::event_loop::EventLoop::new();
let mut builder = winit::window::WindowBuilder::new();
builder = builder.with_title("test");
#[cfg(windows_OFF)]
{
use winit::platform::windows::WindowBuilderExtWindows;
builder = builder.with_no_redirection_bitmap(true);
}
let window = builder.build(&event_loop).unwrap();
let mut framework = pollster::block_on(Framework::new(&window));
framework
.surface
.configure(&framework.device, &framework.sc_desc);
event_loop.run(move |event, _, control_flow| {
*control_flow = if cfg!(feature = "metal-auto-capture") {
ControlFlow::Exit
} else {
ControlFlow::Poll
};
framework.handle_event(&window, event, control_flow);
});
}
I can reproduce @tfgast's issue (and its workaround, when I comment out the encoder as mentioned in the code sample). @tfgast "my" issue seems more like #1689, but I started the conversation on this thread before I came to that conclusion.
@cwfitzgerald I set up the LunarG SDK as described at https://vulkan.lunarg.com/doc/sdk/1.2.182.0/linux/getting_started.html (skipping the step on copying files to system directories, which I think is not necessary based on the instructions given by ash's author here). I have the following environment variables in my shell startup file:
export VULKAN_SDK=$HOME...vulkan/1.2.182.0/x86_64
export PATH="$VULKAN_SDK/bin:$PATH"
export LD_LIBRARY_PATH="$VULKAN_SDK/lib:$LD_LIBRARY_PATH"
export VK_LAYER_PATH="$VULKAN_SDK/etc/vulkan/explicit_layer.d:$VK_LAYER_PATH"
I re-tested my issue on commit 8f02b73655aff641361822a8ac0347fc47622b49, running
RUST_LOG=trace cargo run --example hello-compute --features trace &> hello_compute.log
to generate the log file in the attachment. I ran cargo clean
beforehand, just in case.
hello_compute.zip
The attachment includes the example's logging output, trace files, the Rust code I was running (mentioned above), and the output of vkvia
and vulkaninfo
.
Looks similar to #1878
[2021-08-20T14:05:24Z ERROR wgpu_hal::vulkan::instance] VALIDATION [VUID-vkResetCommandPool-commandPool-00040 (0xb53e2331)] Validation Error: [ VUID-vkResetCommandPool-commandPool-00040 ] Object 0: handle = 0x55b56e35f690, name = _Transit, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0xb53e2331 | Attempt to reset command pool with VkCommandBuffer 0x55b56e35f690[_Transit] which is in use. The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state (https://vulkan.lunarg.com/doc/view/1.2.182.0/linux/1.2-extensions/vkspec.html#VUID-vkResetCommandPool-commandPool-00040)
For anyone who can reproduce this, do you have a dual-GPU configuration with NVidia by any chance?
I wonder if it's related to https://github.com/gfx-rs/wgpu/pull/1898
I do have a dual-GPU with NVidia.
I only have Intel integrated graphics.
I only have integrated gpu, and it's possible to reproduce it by changing the hello-compute
example, I only need to submit an empty command buffer before execute_gpu_inner
. The validation errors happen after 5 seconds, which is the CLEANUP_WAIT_MS
.
I think I might have run into this while updating conrod_wgpu
's wgpu
dep from 0.9
to 0.10
. Specifically, the hang occurs at the end of the first RedrawRequested
event during the Drop
implementation for SurfaceFrame
.
Here's the backtrace from gdb at the moment of hanging:
(gdb) backtrace
#0 0x00007ffff7d1fb07 in ioctl () from /nix/store/gk42f59363p82rg2wv2mfy71jn5w4q4c-glibc-2.32-48/lib/libc.so.6
#1 0x00007fffe2a956c0 in anv_gem_syncobj_timeline_wait ()
from /nix/store/85hbpjblyvgg9k9vvirqk69r8qb1k5dl-mesa-21.1.4-drivers/lib/libvulkan_intel.so
#2 0x00007fffe2ad2d19 in anv_QueuePresentKHR ()
from /nix/store/85hbpjblyvgg9k9vvirqk69r8qb1k5dl-mesa-21.1.4-drivers/lib/libvulkan_intel.so
#3 0x0000555555feed00 in ash::vk::extensions::KhrSwapchainFn::queue_present_khr (self=0x55555710f280, queue=...,
p_present_info=0x7ffffffe6f60)
at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/ash-0.33.3+1.2.191/src/vk/extensions.rs:566
#4 0x0000555555fe2acd in ash::extensions::khr::swapchain::Swapchain::queue_present (self=0x55555710f278, queue=...,
create_info=0x7ffffffe6f60)
at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/ash-0.33.3+1.2.191/src/extensions/khr/swapchain.rs:91
#5 0x0000555555eeceac in wgpu_hal::vulkan::{{impl}}::present (self=0x55555710f270, surface=0x555556e2e840,
texture=...)
at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/wgpu-hal-0.10.4/src/vulkan/mod.rs:531
#6 0x0000555555b474e7 in wgpu_core::hub::Global<wgpu_core::hub::IdentityManagerFactory>::surface_present<wgpu_core::hub::IdentityManagerFactory,wgpu_hal::vulkan::Api> (self=0x555556e29640, surface_id=...)
at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/wgpu-core-0.10.2/src/present.rs:243
#7 0x0000555555c3169b in wgpu::backend::direct::{{impl}}::surface_present (self=0x555556e29640,
texture=0x7ffffffe8040, detail=0x7ffffffe8058)
at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/wgpu-0.10.1/src/backend/direct.rs:929
#8 0x0000555555cac720 in wgpu::{{impl}}::drop (self=0x7ffffffe8038)
at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/wgpu-0.10.1/src/lib.rs:3069
#9 0x0000555555c600e7 in core::ptr::drop_in_place<wgpu::SurfaceTexture> ()
at /nix/store/r218w4jqf2yl6whglfpq0kz61yjn1jhz-rust-default-1.53.0/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:192
#10 0x0000555555767f8b in core::ptr::drop_in_place<wgpu::SurfaceFrame> ()
at /nix/store/r218w4jqf2yl6whglfpq0kz61yjn1jhz-rust-default-1.53.0/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:192
#11 0x0000555555772f31 in all_winit_wgpu::main::{{closure}} (event=..., control_flow=0x7ffffffe8b40)
at backends/conrod_wgpu/examples/all_winit_wgpu.rs:266
#12 0x00005555557b921e in winit::platform_impl::platform::sticky_exit_callback<(),closure-0> (evt=...,
target=0x555556d86970, control_flow=0x7ffffffe8b40, callback=0x7ffffffe9338)
at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/winit-0.25.0/src/platform_impl/linux/mod.rs:746
#13 0x000055555578ec84 in winit::platform_impl::platform::x11::EventLoop<()>::run_return<(),closure-0> (
self=0x7ffffffe9ef0, callback=...)
at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/winit-0.25.0/src/platform_impl/linux/x11/mod.rs:307
#14 0x000055555578fbf3 in winit::platform_impl::platform::x11::EventLoop<()>::run<(),closure-0> (self=...,
callback=...)
at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/winit-0.25.0/src/platform_impl/linux/x11/mod.rs:385
#15 0x00005555557b9086 in winit::platform_impl::platform::EventLoop<()>::run<(),closure-0> (self=..., callback=...)
at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/winit-0.25.0/src/platform_impl/linux/mod.rs:662
#16 0x0000555555780fac in winit::event_loop::EventLoop<()>::run<(),closure-0> (self=..., event_handler=...)
at /home/mindtree/.cargo/registry/src/github.com-1ecc6299db9ec823/winit-0.25.0/src/event_loop.rs:154
#17 0x00005555557718ae in all_winit_wgpu::main () at backends/conrod_wgpu/examples/all_winit_wgpu.rs:111
Here's the WIP PR of the update: https://github.com/PistonDevelopers/conrod/pull/1436. No major changes other than replacing the old GLSL and pre-compiled SPIR-V shaders with WGSL (translated from the old GLSL shaders using the current naga-cli
).
I tried running with validation layers enabled like so:
VK_LAYER_KHRONOS_validation=1 RUST_BACKTRACE=1 cargo run --example all_winit_wgpu
though received no extra output. I'm not 100% I have these installed though! Anyone know offhand if there's an easy way to check on NixOS?
I also tried each of the different present modes in case something other than FIFO worked, but no luck there.
I'm on NixOS + Gnome + Wayland + Intel Xe Graphics (only integrated).
This definitely looks related to #1673. @mitchmindtree could you run wgpu-rs examples from master on your system?
Yes the examples on master appear to work well, I also tried with the commit that published wgpu 0.10.1 (what I'm updating to in conrod_wgpu
) and it seems they still worked then too. I wonder what we're doing to trigger this in the conrod_wgpu
example...
Could you be making multiple submissions per frame? Try making a single submission, just for experiment.
Yep that seems to be it! There's an extra submission before the event loop runs to load the single image that's used in the example - if I remove that submission and let the command be submitted along with the rest of the first frame's command buffer (so that there's only one submission), the example seems to run perfectly.
-------- Original Message -------- On Sep 12, 2021, 01:19, Dzmitry Malyshau wrote:
Could you be making multiple submissions per frame? Try making a single submission, just for experiment.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
I have Intel Xe Graphics, and I see this issue myself.
Considering this fixed by #2212
Edit: to clarify, #2212 is a workaround for systems that haven't updated to https://gitlab.freedesktop.org/mesa/mesa/-/issues/5508 And it's not working very well. There is still some race condition in the driver, but at least it shows a few frames.
Description Under certain scenarios, we may see a hang in
ioctl
.Repro steps Unknown.
Expected vs observed behavior No hangs.
Extra materials Looks like this is described in https://www.reddit.com/r/vulkan/comments/b37762/command_queue_grows_indefinitely_on_intel_gpus/ Edit: actually, no, we aren't expecting
vkAcquireNextImageKHR
to block. We are explicitly blocking on the fence, which was passed to it, instead.Platform wgpu master