Closed buzmeg closed 5 years ago
Interesting, so it looks like we keep losing the surface (or at least the Metal backend thinks we do) and re-create the swapchain. I've been only running in 10.13 and 10.14 during development, and I wonder if there is a big difference with regards to CAMetalLayer
between 10.12 and then. See #2541 being resolved by just updating the OS.
I think that doing an OS upgrade is just papering over a race condition in the code by reordering operations.
The issue is down in window.rs in this bit of code:
println!("Frame: {:?}", frame.drawable);
match frame.drawable.take() {
Some(drawable) => {
frame.available = true;
Ok(drawable)
}
None => {
frame.linked = false;
Err(())
}
}
frame.drawable gets coughed up as None, and that sets off a whole chain of events.
I'm not convinced that None is always an error. (ie. can it be None because the drawable is currently queued for rendering but hasn't been unlinked yet?). I'm also a little concerned by the fact that the code presents three frames in quick succession for no obvious reason (no resize, no mouse event, etc) -- it presents frame 1, frame 0, frame 1, and frame 0 at which point it hiccups and requests a new swap chain.
Unfortunately, this is getting far to deep for my beginner level knowledge of Rust.
That's pretty good analysis! Let me read up the code once more and describe what the logic is supposed to be.
On Jan 10, 2019, at 19:36, buzmeg notifications@github.com wrote:
I think that doing an OS upgrade is just papering over a race condition in the code by reordering operations.
The issue is down in window.rs in this bit of code:
println!("Frame: {:?}", frame.drawable); match frame.drawable.take() { Some(drawable) => { frame.available = true; Ok(drawable) } None => { frame.linked = false; Err(()) } }
frame.drawable gets coughed up as None, and that sets off a whole chain of events.
I'm not convinced that None is always an error. (ie. can it be None because the drawable is currently queued for rendering but hasn't been unlinked yet?). I'm also a little concerned by the fact that the code presents three frames in quick succession for no obvious reason (no resize, no mouse event, etc) -- it presents frame 1, frame 0, frame 1, and frame 0 at which point it hiccups and requests a new swap chain.
Unfortunately, this is getting far to deep for my beginner level knowledge of Rust.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
@buzmeg
I think that doing an OS upgrade is just papering over a race condition in the code by reordering operations.
and to clarify, I'm not suggesting an upgrade. I'm just excusing myself for not seeing this because I was running 10.13+ for a long time.
I'm not convinced that None is always an error. (ie. can it be None because the drawable is currently queued for rendering but hasn't been unlinked yet?)
The reason for it being implemented this way is that we need to return CAMetalDrawable
as soon as possible to CoreAnimation. The frames are not persistent at their API level (unfortunately! that's a major pain for us), but internally they have a persistent-ish queue of frames, size of which we can more or less control.
So what Metal wants us to do is: get new frame, render it, return it, etc. But what we are actually doing - pretending that the frames are known, and when the next frame is returned by CoreAnimation, we compare it to all the ones we have to figure out what index it is. This is hacky, but it's the only solution I could see, and it seems to work in most cases (Dota2, VkQuake, Filament, etc).
If we don't invalidate the drawable on present()
, then when do we ever invalidate it?..
@buzmeg please re-test with/after #2619
Is this related to the swapchain issue in wgpu?
@seivan could be. We'll know for sure when this issue is resolved (pending confirmation).
Just tested today. This problem is still extant.
@buzmeg sorry about not addressing this. Rolling back an OS version on Mac is rather trouble-some. I hope to get my hands on a device with 10.12 installed to investigate. If anyone has that and can look into it, that would be great!
Mstange has 10.12
On Thu, Mar 21, 2019, 10:46 PM Dzmitry Malyshau notifications@github.com wrote:
@buzmeg https://github.com/buzmeg sorry about not addressing this. Rolling back an OS version on Mac is rather trouble-some. I hope to get my hands on a device with 10.12 installed to investigate. If anyone has that and can look into it, that would be great!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gfx-rs/gfx/issues/2563#issuecomment-475475137, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUTbQWmHRCUnhRx1bO56zzljs8eaZ2Fks5vZEQIgaJpZM4Z5Fi- .
Just wanted to report that this example doesn't even compile on macOS 10.11:
$ cargo run --bin quad --features metal
Compiling hal-examples v0.1.0 (/Users/oliver/github/gfx/examples)
error: linking with `cc` failed: exit code: 1
|
= note: Undefined symbols for architecture x86_64:
"_kdebug_signpost", referenced from:
gfx_backend_metal::native::Signpost::place::h3b23ab3b2732ed2b in libgfx_backend_metal-0929485cdb8a249d.rlib(gfx_backend_metal-0929485cdb8a249d.2q8alb1fnsqucsuc.rcgu.o)
"_kdebug_signpost_start", referenced from:
gfx_backend_metal::native::Signpost::new::h2e5d7e1612625c03 in libgfx_backend_metal-0929485cdb8a249d.rlib(gfx_backend_metal-0929485cdb8a249d.2q8alb1fnsqucsuc.rcgu.o)
"_kdebug_signpost_end", referenced from:
_$LT$gfx_backend_metal..native..Signpost$u20$as$u20$core..ops..drop..Drop$GT$::drop::hdb91ab91b1828916 in libgfx_backend_metal-0929485cdb8a249d.rlib(gfx_backend_metal-0929485cdb8a249d.2q8alb1fnsqucsuc.rcgu.o)
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: aborting due to previous error
error: Could not compile `hal-examples`.
I tried commenting out those functions in gfx/src/backend/metal/src/native.rs
to see what happens, and then it links ok but panics when run:
AdapterInfo { name: "Intel Iris Pro Graphics", vendor: 0, device: 0, device_type: IntegratedGpu }
Memory types: [MemoryType { properties: DEVICE_LOCAL, heap_index: 0 }, MemoryType { properties: CPU_VISIBLE | COHERENT, heap_index: 1 }, MemoryType { properties: DEVICE_LOCAL | CPU_VISIBLE, heap_index: 1 }, MemoryType { properties: DEVICE_LOCAL | CPU_VISIBLE | CPU_CACHED, heap_index: 1 }]
formats: Some([Bgra8Unorm, Bgra8Srgb, Rgba16Sfloat])
SwapchainConfig { present_mode: Fifo, composite_alpha: OPAQUE, format: Bgra8Srgb, extent: Extent2D { width: 2048, height: 1536 }, image_count: 2, image_layers: 1, image_usage: COLOR_ATTACHMENT }
thread 'main' panicked at 'platform does not support specialization', src/backend/metal/src/device.rs:80:9
note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.```
@datadog23 thank you! Signposts should be easy to disable on OSX prior to 10.12. They are not required to function. Is that something you'd like to help with? You could start with looking at https://github.com/gfx-rs/gfx/blob/1ede66913f839d534deab59648c42f58a1c392dd/src/backend/metal/src/native.rs#L897
@datadog23 Also to work around specialization constants (at least temporarily), I think you should be able to change:
specialization: hal::spec_const_list![0.8f32],
(https://github.com/gfx-rs/gfx/blob/master/examples/quad/main.rs#L546)
to
specialization: pso::Specialization::default(),
If this doesn't work you might need to adjust the quad vertex shader too.
@kvark I don't know what the purpose of the signposts is, they seem to be used for swapchain debugging in window.rs
? And I don't know how to conditionally not compile them based on OS version.
@grovesNL Changing line 546 gets me the same symptoms as the original issue above, a blank white window with this in the terminal:
AdapterInfo { name: "Intel Iris Pro Graphics", vendor: 0, device: 0, device_type: IntegratedGpu }
Memory types: [MemoryType { properties: DEVICE_LOCAL, heap_index: 0 }, MemoryType { properties: CPU_VISIBLE | COHERENT, heap_index: 1 }, MemoryType { properties: DEVICE_LOCAL | CPU_VISIBLE, heap_index: 1 }, MemoryType { properties: DEVICE_LOCAL | CPU_VISIBLE | CPU_CACHED, heap_index: 1 }]
formats: Some([Bgra8Unorm, Bgra8Srgb, Rgba16Sfloat])
SwapchainConfig { present_mode: Fifo, composite_alpha: OPAQUE, format: Bgra8Srgb, extent: Extent2D { width: 2048, height: 1536 }, image_count: 2, image_layers: 1, image_usage: COLOR_ATTACHMENT }
resized to LogicalSize { width: 1024.0, height: 768.0 }
SwapchainConfig { present_mode: Fifo, composite_alpha: OPAQUE, format: Bgra8Srgb, extent: Extent2D { width: 2048, height: 1536 }, image_count: 2, image_layers: 1, image_usage: COLOR_ATTACHMENT }
..the last line repeating forever.
I don't even understand why a 'specialization constant' was used (assuming it's the same as this?), there's only a single float in the vertex shader and it's a 2D scaling factor. If it needs to change why not pass a uniform, instead of having to recompile the shader? I'll have another look when I understand more about Vulkan.
I think we'd ideally be able to put the externs for signposts and anywhere they're used behind cfg
based on OS version. I'm not sure of the options here though.
Changing line 546 gets me the same symptoms as the original issue above
Thanks for the update, that's useful to know.
I don't even understand why a 'specialization constant' was used
Yeah it's just used to demonstrate how to use specialization constants in gfx-hal. In this case there's no real reason the value couldn't be hardcoded into the shader or provided as a uniform instead.
For signposts, the easiest way to proceed would be covering all of them in a new feature (#[cfg(feature = "signpost")]
that is not default but enabled on CI.
I don't know how to add optional features like that: although cargo --help
run in the examples dir says it can take a space-separated list, all the features it currently supports are mutually exclusive and it only acts on the first one. The --cfg
flag @grovesNL links to only seems to work with rustc
, and I don't know which Cargo.toml
's [features]
section to edit, or how.
Line 25 in backend/metal/build.rs
seems to be detecting the macOS version number somehow, but no idea how to use that to set a "no-signposts" feature either.
ETA: I've just found "The Cargo Book", that might help!
@datadog23 you can check auto-capture feature for an example.
I don't know how to submit edits.
How would one enable the signpost (or auto-capture) feature, btw? cargo run --bin quad --features metal signpost
doesn't seem to reenable the things I've cfg'd out, nor --features="metal signpost"
@datadog23 you can file a pill request here: click "Fork" on Github, check out the fork, make you change in a branch, then push into your fork branch. A button will appear here to create a PR from it.
How would one enable the signpost (or auto-capture) feature, btw?
I think this should work:
cargo run --bin quad --features metal,gfx-backend-metal/signpost
Just pulled gfx again. Still not working with same SwapchainConfig message. As a side note, the vulkan version fails on OS X:
$ cargo run --bin quad --features=vulkan
Compiling gfx-backend-vulkan v0.2.0 (/Users/andrewl/rust/gfx/src/backend/vulkan)
error[E0463]: can't find crate for `x11`
--> src/backend/vulkan/src/lib.rs:22:1
|
22 | extern crate x11;
| ^^^^^^^^^^^^^^^^^ can't find crate
Thanks.
As a side note, the vulkan version fails on OS X
Maybe because OSX doesn't have Vulkan?
I was puzzled by this too -- the MoltenVK examples all run, so why doesn't gfx-hal just fall back to that? I was expecting invoking --features vulkan
on macOS would do the obvious thing, not complain about not being run on linux.
Um, then, perchance you want to ask the LunarG guys why they have a Vulkan SDK available for download for OS X here: https://vulkan.lunarg.com/sdk/home
I will open a separate issue rather than continuing to clutter this one as gfx still won't run the demos on my OS X machine either with Metal or Vulkan.
Separate issue opened: https://github.com/gfx-rs/gfx/issues/2796
Vulkan is not natively available on Apple systems. LunarG SDK is just a piece of software that's not even tightly controlled by the Vulkan working group.
That is to say, if users want to run their applications via that SDK, they are free to do so. I don't see agood reason why we'd need to support it in the examples though.
On Jun 3, 2019, at 05:21, buzmeg notifications@github.com wrote:
Um, then, perchance you want to ask the LunarG guys why they have a Vulkan SDK available for download for OS X here: https://vulkan.lunarg.com/sdk/home
I will open a separate issue rather than continuing to clutter this one as gfx still won't run the demos on my OS X machine either with Metal or Vulkan.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Ran into this issue as well on master
(as of 1095b2b1170f9e4342d6f8a73a9eda96d542e0a6
)
MacOS: 10.14.5 Graphics: Intel Iris Plus Graphics 650 1536 MB 2017 MBP
❯ git rev-parse HEAD [13:40:29]
1095b2b1170f9e4342d6f8a73a9eda96d542e0a6
❯ cargo run --bin quad --features metal [13:40:32]
Finished dev [unoptimized + debuginfo] target(s) in 0.25s
Running `/Users/mstone/dev/gfx/target/debug/quad`
AdapterInfo { name: "Intel(R) Iris(TM) Plus Graphics 650", vendor: 0, device: 0, device_type: IntegratedGpu }
Memory types: [MemoryType { properties: DEVICE_LOCAL, heap_index: 0 }, MemoryType { properties: CPU_VISIBLE | COHERENT, heap_index: 1 }, MemoryType { properties: DEVICE_LOCAL | CPU_VISIBLE, heap_index: 1 }, MemoryType { properties: DEVICE_LOCAL | CPU_VISIBLE | CPU_CACHED, heap_index: 1 }]
formats: Some([Bgra8Unorm, Bgra8Srgb, Rgba16Sfloat])
SwapchainConfig { present_mode: Fifo, composite_alpha: OPAQUE, format: Bgra8Srgb, extent: Extent2D { width: 2048, height: 1536 }, image_count: 2, image_layers: 1, image_usage: COLOR_ATTACHMENT }
resized to LogicalSize { width: 1024.0, height: 768.0 }
SwapchainConfig { present_mode: Fifo, composite_alpha: OPAQUE, format: Bgra8Srgb, extent: Extent2D { width: 2048, height: 1536 }, image_count: 2, image_layers: 1, image_usage: COLOR_ATTACHMENT }
SwapchainConfig { present_mode: Fifo, composite_alpha: OPAQUE, format: Bgra8Srgb, extent: Extent2D { width: 2048, height: 1536 }, image_count: 2, image_layers: 1, image_usage: COLOR_ATTACHMENT }
SwapchainConfig { present_mode: Fifo, composite_alpha: OPAQUE, format: Bgra8Srgb, extent: Extent2D { width: 2048, height: 1536 }, image_count: 2, image_layers: 1, image_usage: COLOR_ATTACHMENT }
SwapchainConfig { present_mode: Fifo, composite_alpha: OPAQUE, format: Bgra8Srgb, extent: Extent2D { width: 2048, height: 1536 }, image_count: 2, image_layers: 1, image_usage: COLOR_ATTACHMENT }
SwapchainConfig
messages repeats indefinitely.
I originally found this on my own renderer and thought I'd gotten something wrong. Tried it on the quad example as a sanity check. Resulted in the same issue and then I found this GitHub issue.
Thanks for the info @bigmstone ! We don't have a fix in the process, but we are looking at the issue. Would you want to try running any of wgpu-rs examples to confirm if the same issue is observed there?
@kvark Yeah, I can check -- My renderer is a side project for me so I won't be able to dedicate a normal amount of time to it. I will look into the root cause and see if I can drum something up when I'm working on this. (Mostly nights/weekends)
Re: wgpu-rs examples -- they appear to be working as expected. Frame-rate of shadow
does not look like it's recreating the swapchain every 3rd frames, and I don't see any output. Haven't examined the code, but I'll look at that tonight.
Okay, I had a bit of a chance to look at this over the last couple of nights. Here's where I'm at/what I'm seeing.
Disclaimer: I'm a bit ignorant to pretty much all things graphics...so...I might be getting all this wrong. 😅
So I traced back through a typical failure from the logs. When we submit a queue we don't return a Result
which doesn't exactly match up to vkQueueSubmit
. This is where the error first manifests itself. Might consider changing gfx_hal::queue::CommandQueue
's submit
method to return a Result
to catch the error sooner?
@buzmeg observed take_drawable
failing, but I believe this is operating as designed. If there's no drawable to take then it should bail out. I think None
is always an error because the fence should prevent the race described. I haven't fully traced that, so take with a grain of salt, but my next observation I think further solidifies this for me.
I opened with the initial error happening during queue submission. During queue submission the trace is:
at window.rs::gfx_backend_metal::window::SurfaceInner::next_frame::_$u7b$$u7b$closure$u7d$$u7d$::h655ca9c47527b2ae:127
at autorelease.rs::objc::rc::autorelease::autoreleasepool::hb806d60b31173f4f:29
at window.rs::gfx_backend_metal::window::SurfaceInner::next_frame::hb99c61ac413b95af:90
* at window.rs::gfx_backend_metal::window::SwapchainImage::wait_until_ready::h384367e4c220bed9:275
at command.rs::gfx_backend_metal::command::CommandQueue::wait::h4ead9b298d75fb5f:1894
at command.rs::_$LT$gfx_backend_metal..command..CommandQueue$u20$as$u20$gfx_hal..queue..RawCommandQueue$LT$gfx_backend_metal..Backend$GT$$GT$::submit::h4f406aebd81e160d:1923
at mod.rs::gfx_hal::queue::CommandQueue$LT$B$C$C$GT$::submit::h6590d0441c153524:138
at main.rs::quad::main::hf1c854eb28eb897a:799
at rt.rs::std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::hf4d59476fdad3709:64
at rt.rs::do_call<closure,i32> [inlined] {{closure}}:49
at panicking.rs::do_call<closure,i32>:293
at lib.rs::__rust_maybe_catch_panic:87
at panicking.rs::lang_start_internal [inlined] try<i32,closure>:272
at panic.rs::lang_start_internal [inlined] catch_unwind<closure,i32>:388
at rt.rs::lang_start_internal:48
at rt.rs::std::rt::lang_start::hdac7e6390caeec42:64
During next_frame
we're iterating over the fames in the swapchain looking for a texture received from the call to Metal's texture
based off the drawable we get on the line above. If we break there we can inspect the swapchain's frames and the texture we received from the texture
call to the drawable.
(&[gfx_backend_metal::window::Frame]) frames = &[
Frame {
texture: Texture(&0x10280da00)
},
Frame {
texture: Texture(&0x10303c400)
},
Frame {
texture: Texture(&0x102816a00)
},
Frame {
texture: Texture(&0x102815800)
}]
(metal::drawable::DrawableRef *) drawable = &0x101d5b980
Note: I've removed everything from the frame except the frame's texture to make it easier to parse.
As you can tell the texture isn't anywhere in the swapchain's frames, and since they don't match we (silently) fail and move on to presentation. Since we never had a chance to set the drawable of the frame take_drawable
is going to fail which results in swapchain recreation and the missed render.
Interestingly when I inspect this with Xcode's profiler (just found this, and wowah it's pretty cool) I see this strange empty
call to Metal just before every swapchain recreation.
So the question I'm asking myself now is, "Why are these textures different?" So I'll continue from there unless someone has any insights on if I'm mistaken or know the issue given my observations.
This is some great investigation @bigmstone ! The logic of the metal backend is by far not simple, and you were able to get most of the details correctly.
I see this strange empty call to Metal just before every swapchain recreation.
We do a dummy present on the first grabbed image of the new swapchain here. Is that the one you are seeing?
So the question I'm asking myself now is, "Why are these textures different?"
The main problem with our approach today is that the way the internal CAImageQueue
is recycling images appears to be highly driver/os dependent. I tried to reverse-engineer it with Ghidra but didn't find all the answers...
I wish we had a more robust approach here. The only known alternative is what MoltenVK is doing: dererred-recording of all the command buffers and resolving frame textures at the time a command buffer is conceptually submitted - this is where it starts being translated into the native metal command buffer... I don't want to go this path, since I consider immediate recording to be one of our strongest advantage and a differentiating factor. Some more info on immediate vs deferred is on our blog.
If you have any ideas on how we can make this better, or sketch out a completely different scheme, that would help tremendously.
I think we'd be best off sketching out a completely different scheme. My first instinct is to modify create_swapchain
so that it no longer returns the swapchain images for you, but instead only returns the swapchain. acquire_image
would be modified to return the actual swapchain image for you. For the Metal backend, this would essentially be a pass-through to nextDrawable
. For the Vulkan backend, this would involve calling vkGetSwapchainImagesKHR
and vkAcquireNextImageKHR
.
A problem with this is that we need to know the swapchain images in advance for us to create the framebuffers in Vulkan (as for each swapchain image we require a framebuffer where it's used as an attachment). I'm not sure what the best approach here would be. Could gfx-hal manage framebuffers for you automatically? This would involve adding new parameters to swapchain creation for information about the render pass and attachments for each framebuffer, and perhaps a variant could be used for attachments that refer to the current swapchain image.
Does any of this make sense?
@aleksijuvani it does make sense, and this would indeed be (rather trivially) the least common denominator API between the backends. Main problem is Vulkan Portability, which is a big important use case for us. Metal model doesn't map to it very well (as we can see here...).
One way forward could be for gfx-hal to provide both was of managing the swapchain. Just like Vulkan swapchain API is just an extension, we could have multiple variants supported (Vulkan and Metal styles). There are still some details missing about how the Metal model would work, given that the rest of the API is Vulkan-like.
Another possibility that I just thought of is to have gfx-hal only provide the Metal swapchain model, while the awkwardness of Vulkan -> Metal translation in this area is simply moved from our Metal backend to gfx-portability. Having gfx-portability run on Vulkan would mean a lot of extra work then, but nobody needs that path anyway. Again, there are still unknowns about how gfx-portability DX backends would work with this, but at lest we'll have full control of what happens implicitly in the backends, so that we can stitch it smoothly with gfx-portability.
cc @grovesNL @JohnColanduoni
I think this discussion went far away from the original issue. That one is addressed by a combination of #2619, #3049, and the new swapchain model. Please file new issues for unrelated stuff!
$ cargo run --bin quad --features=metal
Produces a grey window and nothing else
Short info header:
$git log commit 6c3c1d335778fba3333d395f486a8a032805cec6 Merge: f8b74e2 86c5d35 Author: bors[bot] bors[bot]@users.noreply.github.com Date: Wed Jan 9 17:45:47 2019 +0000
examples$ cargo run --bin quad --features=metal Finished dev [unoptimized + debuginfo] target(s) in 0.20s
Running
/Users/andrewl/rust/gfx/target/debug/quad
AdapterInfo { name: "Intel(R) Iris(TM) Graphics 6100", vendor: 0, device: 0, device_type: DiscreteGpu } Memory types: [MemoryType { properties: DEVICE_LOCAL, heap_index: 0 }, MemoryType { properties: COHERENT | CPU_VISIBLE, heap_index: 1 }, MemoryType { properties: DEVICE_LOCAL | CPU_VISIBLE, heap_index: 1 }, MemoryType { properties: DEVICE_LOCAL | CPU_VISIBLE | CPU_CACHED, heap_index: 1 }] formats: Some([Bgra8Unorm, Bgra8Srgb, Rgba16Float]) SwapchainConfig { present_mode: Fifo, composite_alpha: Inherit, format: Bgra8Srgb, extent: Extent2D { width: 1024, height: 768 }, image_count: 2, image_layers: 1, image_usage: COLOR_ATTACHMENT } SwapchainConfig { present_mode: Fifo, composite_alpha: Inherit, format: Bgra8Srgb, extent: Extent2D { width: 1024, height: 768 }, image_count: 2, image_layers: 1, image_usage: COLOR_ATTACHMENT } SwapchainConfig { present_mode: Fifo, composite_alpha: Inherit, format: Bgra8Srgb, extent: Extent2D { width: 1024, height: 768 }, image_count: 2, image_layers: 1, image_usage: COLOR_ATTACHMENT } SwapchainConfig { present_mode: Fifo, composite_alpha: Inherit, format: Bgra8Srgb, extent: Extent2D { width: 1024, height: 768 }, image_count: 2, image_layers: 1, image_usage: COLOR_ATTACHMENT } SwapchainConfig { present_mode: Fifo, composite_alpha: Inherit, format: Bgra8Srgb, extent: Extent2D { width: 1024, height: 768 }, image_count: 2, image_layers: 1, image_usage: COLOR_ATTACHMENT }[lines keep repeating infinitely]