Open kpreid opened 1 year ago
How would we go about implementing this on vk/dx12/metal backends? A dedicated OS-thread for each pipeline?
The best solution would probably require wgpu to use a thread pool, since spawning OS threads for individual jobs like that might have some decent overhead.
If so I'd definitely prefer if wgpu had a way to disable its own thread pool. My understanding is that this is an API that is mostly useful for browser usage, since WebGPU in a browser doesn't yet have any way to do multithreading. In a native context I should just be able to call create_*_pipeline()
myself from my own thread pool.
How is this function implemented in the browser? Do they use thread pools for this?
I wrote the original issue simply as “here is a discrepancy with the spec” without further thought, but here are some further thoughts:
My use of wgpu
targets web and desktop (as is probably desirable for many games). Therefore, I want to be able to have good performance (no event loop hiccups) on web, which means that I must use async. Conveniently, the web environment already “has an async executor”, inherently, and wgpu
-on-web targets that executor (unavoidably, insofar as it has anything async at all).
This together with @PJB3005's point suggests that the solution here and for similar situations might be for wgpu
to allow plugging in an executor in some fashion; to be handed a “spawn_blocking()
” function that it can use when some operation is supposed to be async but this is not natively supported. That way, wgpu
would not need to manage its own thread pool.
Certainly this could be done by a layer on top of wgpu
, but it is annoying to have to design and maintain additional code just to get what WebGPU specifies to be possible. Also, in particular, if wgpu
had an executor (or thread pool) available to it, then it could make map_async()
much more straightforward to use — right now it is awkward to use in a truly async fashion without potentially missing wakeups.
These would be very useful for us, we bulk create a lot of APIs and it's a significant cost on some machines to do that concurrently - especially with webgl backend. I've been tempted to try and write a hypothetical "DeviceExt::create_render_pipeline_bulk()" method, but async would solve it much better.
We definitely don't want to utilize or make threads on our own. In line with the other async functions, a simple native implementation of this api would be calling the standard create_*_pipeline
then return a already ready future.
This is a hard decision to make, as it's very hard to paper over the differences between native and web.
So we've been musing about this same problem in the webgpu.h standardization meetings and we have come up with a possible solution that we're asking for feedback on. The rough C solution is here but I will translate this to the rust api:
type Task = Box<dyn FnOnce() + Send>;
type TaskCallback = Box<dyn Fn(Task) + Send + Sync>
// Maybe actually the device
struct InstanceDescriptor {
...
// Callback which will be called when the implementation wants to do work on another thread.
// If this is not provided, the implementation will not do any work on any threads.
//
// The callback will be called with the task that the runtime wants to do on a thread.
// This task should be spawned onto a threadpool, immediately invoked inside the callback, or otherwise
// made to execute.
//
// It should be assumed that the work spawned on this callback will be of substantial time (1ms+) and pure compute.
task_executor: Option<TaskCallback>,
...
}
impl Device {
// On webgpu will call createRenderPipeline.
//
// On native will:
// - If allow_async is false will create the render pipeline inside the call.
// - If allow_async is true, the implementation is allowed (but not required) to spawn a
// job on the task callback to do the work of compilation if such a callback exists. This leads to
// less predicable performance but increased overall performance as compilation is parallelized.
fn create_render_pipeline(&self, desc: RenderPipelineDescriptor, allow_async: bool) -> RenderPipeline;
// On webgpu will call createRenderPipelineAsync.
//
// On native will:
// - Spawn a job on the instance's `task_executor` if it exists to generate the pipeline. Otherwise:
// - Create the render pipeline inside the call.
async fn create_render_pipeline_async(&self, desc: RenderPipelineDescriptor) -> RenderPipeline;
}
This api should allow people to use arbitrary integrations:
let desc = InstanceDescriptor {
...
task_executor: Some(Box::new(|task| tokio::spawn_blocking(task)))
}
let desc = InstanceDescriptor {
...
task_executor: Some(Box::new(|task| rayon::spawn(task)))
}
let my_fancy_threadpool_spawner: Arc<T> = ...;
let desc = InstanceDescriptor {
...
task_executor: Some(Box::new(move |task| my_fancy_threadpool_spawner.spawn(task)))
}
Looking forward to people's thoughts on this. This kind of design will also open the door to other possible optimizations like this.
Why do we have to worry about this at all? Why can't the user just
Device
in an Arc
or used scoped threads or whateverDevice::create_compute_pipeline
there, andIn other words - we have this thread-safe API, so the whole point should be to make the user deal with that stuff. @kpreid, can't you just write your own create_compute_pipeline_async
? It's not like wgpu is any better at threading than you are.
There's a couple considerations here that are pushing towards having internal handling:
The question of "why do we care at all" is still a good one.
It pretty much boils down to the WASM implementations. For Jim's userspace solution to work on WASM:
Additionally, on WASM, the extra thread needed to initiate that pipeline creation is useless - the actual parallelization is happening in the GPU process, so an extra JS thread is wasted overhead. And JS threads are very expensive compared to native threads. So it works fine if you have that thread already, but it's detrimental if you didn't actually need it.
Hence I think it's best for the default experience in native to match the experience in JS (or other remoting implementations) closely where possible.
Okay - I understand what I wasn't getting before.
The WebGPU create*PipelineAsync
functions expose parallelism not otherwise available to web content: at least in today's browsers, content cannot just fork off another thread, call GPUDevice.create*Pipeline
(sans Async
) , and send the pipeline back to the caller once it's usable - but create*PipelineAsync
essentially gets you the same effect.
That means that this new parallelism can only be offered to programs using wgpu
on the web if wgpu
's web backend calls those functions.
That means that wgpu
's own API must include those functions.
Since wgpu
wants to provide a common API for web and native, wgpu
's native API must now support these functions (even though native-only code should always prefer to just take care of the matter itself, as I suggested above).
Connor's proposal assumes that native wgpu
is just going to fork off a thread, and lets wgpu
's user set the policy for doing so, in a way that should only impact setup code.
Additionally, on WASM, the extra thread needed to initiate that pipeline creation is useless - the actual parallelization is happening in the GPU process, so an extra JS thread is wasted overhead.
Right - a separate thread in the content process invites the use of a separate thread in the GPU process, but doesn't require it, so it's useless.
I like the general idea. How would a user of the API know when a task is done?
The user would need to have their own signalling as part of the function provided to the hook.
Just to add a reason for why this is needed on native, Metal is weird and will block on creating pipelines unless you pass in a callback at pipeline creation time. (bevy's async pipeline compilation ran into this with wgpu's existing create_*_pipeline()
, and it ended up being quite a bit slower to try to create metal pipelines asynchronously/in parallel).
VK_KHR_deferred_host_operations
may be interesting as API design prior art and/or a backend on which this might be implemented.
The WebGPU specification includes
GPUDevice.createComputePipelineAsync()
GPUDevice.createRenderPipelineAsync()
and their use is recommended to avoid blocking upon pipeline creation. There are currently no corresponding functions in
wgpu
; presumably there should be.