Open georgemorgan opened 2 years ago
Just a random guess: have you verified that this doesn't run into the OS timeout given the huge workload (60k workgroups with 8M iterations per workgroup)? Commenting out the write operation probably allows the driver to DCE the loop in the shader.
Just a random guess: have you verified that this doesn't run into the OS timeout given the huge workload (60k workgroups with 8M iterations per workgroup)? Commenting out the write operation probably allows the driver to DCE the loop in the shader.
Hmm, yeah that could totally be the problem. That would explain the visual hitch I get each time I run it. That may be the OS resetting the card. How would I get around that? Run fewer workgroups? I want to ensure the card is at 100% util if I can; I figured the driver / OS would preempt the shader execution to have the card do other work instead of just totally resetting it.
If you just want to run it locally there is probably a few to manually disable the timeout. In general you can try splitting it over multiple dispatches and ideally also split the workload done per shader - I guess the 8M loop iterations are more troublesome in this case.
Run on the master branch of wgpu using M1 Mac,crashed on vk backend too. it works on metal backend, but the output are wrong:
... 59953, 59954, 59955, 59956, 59957, 59958, 59959, 59960, 59961, 59962, 59963, 59964, 59965, 59966, 59967, 59968, 59969, 59970, 59971, 59972, 59973, 59974, 59975, 59976, 59977, 59978, 59979, 59980, 59981, 59982, 59983, 59984, 59985, 59986, 59987, 59988, 59989, 59990, 59991, 59992, 59993, 59994, 59995, 59996, 59997, 59998, 59999]
If slightly change shader code from
result = max(result, a);
lots_of_data[i] = result;
to:
lots_of_data[i] = max(result, a);
both backends work fine and output the correct results:
... 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123]
Description In my compute shader, I do a read from a STORAGE buffer, followed by an operation, followed by a write back to that buffer.
On Linux with an Nvidia card using the Vulkan backend, this causes a
parent device is lost
error to be thrown, indicating the GPU has crashed. On Mac OS using Metal, the shader simply returns no data and the WindowServer process uses 100% GPU until until I reboot my machine.Repro steps
Checkout this commit, and run
cargo run --example hello-compute
.https://github.com/georgemorgan/wgpu/commit/66391306790c3ade21d49cb2d944965755f8e094
Expected vs observed behavior Expected behavior is the compute shader returns
60000
u32 with value123
. Observed behavior is that it returns the initial data in the buffer (1-59999), indicating that no work was done - and the GPU crashes.Comment out the line
lots_of_data[i] = result;
in the shader and run it again. The GPU will not crash, and will return the expected 60k element array of123
.Platform