Compute shader crash when doing sequential read and write from/to STORAGE buffer larger than 8MB

georgemorgan commented 2 years ago

Description In my compute shader, I do a read from a STORAGE buffer, followed by an operation, followed by a write back to that buffer.

        var a: u32 = lots_of_data[i];
        result = max(result, a);
        lots_of_data[i] = result;

On Linux with an Nvidia card using the Vulkan backend, this causes a parent device is lost error to be thrown, indicating the GPU has crashed. On Mac OS using Metal, the shader simply returns no data and the WindowServer process uses 100% GPU until until I reboot my machine.

Repro steps

Checkout this commit, and run cargo run --example hello-compute.

https://github.com/georgemorgan/wgpu/commit/66391306790c3ade21d49cb2d944965755f8e094

Expected vs observed behavior Expected behavior is the compute shader returns 60000 u32 with value 123. Observed behavior is that it returns the initial data in the buffer (1-59999), indicating that no work was done - and the GPU crashes.

Comment out the line lots_of_data[i] = result; in the shader and run it again. The GPU will not crash, and will return the expected 60k element array of 123.

Platform

Adapter 0:
    Backend:   Metal
    Name:      "Apple M1 Pro"
    VendorID:  0
    DeviceID:  0
    Type:      IntegratedGpu
    Compliant: true
    Features:
        DEPTH_CLIP_CONTROL:                                             true
        TEXTURE_COMPRESSION_BC:                                         true
        INDIRECT_FIRST_INSTANCE:                                        true
        TIMESTAMP_QUERY:                                                false
        PIPELINE_STATISTICS_QUERY:                                      false
        MAPPABLE_PRIMARY_BUFFERS:                                       true
        TEXTURE_BINDING_ARRAY:                                          true
        BUFFER_BINDING_ARRAY:                                           false
        STORAGE_RESOURCE_BINDING_ARRAY:                                 true
        SAMPLED_TEXTURE_AND_STORAGE_BUFFER_ARRAY_NON_UNIFORM_INDEXING:  true
        UNIFORM_BUFFER_AND_STORAGE_TEXTURE_ARRAY_NON_UNIFORM_INDEXING:  true
        PARTIALLY_BOUND_BINDING_ARRAY:                                  false
        UNSIZED_BINDING_ARRAY:                                          false
        MULTI_DRAW_INDIRECT:                                            false
        MULTI_DRAW_INDIRECT_COUNT:                                      false
        PUSH_CONSTANTS:                                                 true
        ADDRESS_MODE_CLAMP_TO_BORDER:                                   true
        POLYGON_MODE_LINE:                                              true
        POLYGON_MODE_POINT:                                             false
        TEXTURE_COMPRESSION_ETC2:                                       true
        TEXTURE_COMPRESSION_ASTC_LDR:                                   true
        TEXTURE_ADAPTER_SPECIFIC_FORMAT_FEATURES:                       true
        SHADER_FLOAT64:                                                 false
        VERTEX_ATTRIBUTE_64BIT:                                         false
        CONSERVATIVE_RASTERIZATION:                                     false
        VERTEX_WRITABLE_STORAGE:                                        true
        CLEAR_TEXTURE:                                                  true
        SPIRV_SHADER_PASSTHROUGH:                                       false
        SHADER_PRIMITIVE_INDEX:                                         false
        MULTIVIEW:                                                      false
        TEXTURE_FORMAT_16BIT_NORM:                                      true
        ADDRESS_MODE_CLAMP_TO_ZERO:                                     true
        TEXTURE_COMPRESSION_ASTC_HDR:                                   true
    Limits:
        Max Texture Dimension 1d:                        16384
        Max Texture Dimension 2d:                        16384
        Max Texture Dimension 3d:                        2048
        Max Texture Array Layers:                        2048
        Max Bind Groups:                                 8
        Max Dynamic Uniform Buffers Per Pipeline Layout: 8
        Max Dynamic Storage Buffers Per Pipeline Layout: 4
        Max Sampled Textures Per Shader Stage:           16
        Max Samplers Per Shader Stage:                   1024
        Max Storage Buffers Per Shader Stage:            8
        Max Storage Textures Per Shader Stage:           8
        Max Uniform Buffers Per Shader Stage:            12
        Max Uniform Buffer Binding Size:                 4294967295
        Max Storage Buffer Binding Size:                 4294967295
        Max Vertex Buffers:                              8
        Max Vertex Attributes:                           16
        Max Vertex Buffer Array Stride:                  2048
        Max Push Constant Size:                          4096
        Min Uniform Buffer Offset Alignment:             256
        Min Storage Buffer Offset Alignment:             256
        Max Inter-Stage Shader Component:                128
        Max Compute Workgroup Storage Size:              65536
        Max Compute Invocations Per Workgroup:           1024
        Max Compute Workgroup Size X:                    256
        Max Compute Workgroup Size Y:                    256
        Max Compute Workgroup Size Z:                    64
        Max Compute Workgroups Per Dimension:            65535
    Downlevel Properties:
        Shader Model:                        Sm5
        COMPUTE_SHADERS:                     true
        FRAGMENT_WRITABLE_STORAGE:           true
        INDIRECT_EXECUTION:                  true
        BASE_VERTEX:                         true
        READ_ONLY_DEPTH_STENCIL:             true
        NON_POWER_OF_TWO_MIPMAPPED_TEXTURES: true
        CUBE_ARRAY_TEXTURES:                 true
        COMPARISON_SAMPLERS:                 true
        INDEPENDENT_BLEND:                   true
        VERTEX_STORAGE:                      true
        ANISOTROPIC_FILTERING:               true
        FRAGMENT_STORAGE:                    true
        MULTISAMPLED_SHADING:                true
        DEPTH_TEXTURE_AND_BUFFER_COPIES:     true

msiglreith commented 2 years ago

Just a random guess: have you verified that this doesn't run into the OS timeout given the huge workload (60k workgroups with 8M iterations per workgroup)? Commenting out the write operation probably allows the driver to DCE the loop in the shader.

georgemorgan commented 2 years ago

Just a random guess: have you verified that this doesn't run into the OS timeout given the huge workload (60k workgroups with 8M iterations per workgroup)? Commenting out the write operation probably allows the driver to DCE the loop in the shader.

Hmm, yeah that could totally be the problem. That would explain the visual hitch I get each time I run it. That may be the OS resetting the card. How would I get around that? Run fewer workgroups? I want to ensure the card is at 100% util if I can; I figured the driver / OS would preempt the shader execution to have the card do other work instead of just totally resetting it.

msiglreith commented 2 years ago

If you just want to run it locally there is probably a few to manually disable the timeout. In general you can try splitting it over multiple dispatches and ideally also split the workload done per shader - I guess the 8M loop iterations are more troublesome in this case.

jinleili commented 2 years ago

Run on the master branch of wgpu using M1 Mac，crashed on vk backend too. it works on metal backend, but the output are wrong:

... 59953, 59954, 59955, 59956, 59957, 59958, 59959, 59960, 59961, 59962, 59963, 59964, 59965, 59966, 59967, 59968, 59969, 59970, 59971, 59972, 59973, 59974, 59975, 59976, 59977, 59978, 59979, 59980, 59981, 59982, 59983, 59984, 59985, 59986, 59987, 59988, 59989, 59990, 59991, 59992, 59993, 59994, 59995, 59996, 59997, 59998, 59999]

If slightly change shader code from

result = max(result, a);
lots_of_data[i] = result;

to:

lots_of_data[i] = max(result, a);

both backends work fine and output the correct results:

... 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123]

gfx-rs / wgpu

Compute shader crash when doing sequential read and write from/to STORAGE buffer larger than 8MB #2554