floooh / sokol

minimal cross-platform standalone C headers
https://floooh.github.io/sokol-html5
zlib License
6.85k stars 485 forks source link

sokol_gfx: Issue with async buffer initialization #222

Open floatms opened 4 years ago

floatms commented 4 years ago

Hi! I just wanted to get some clarification on how sg_alloc_buffer and sg_init_buffer are supposed to be used. I've been generating mesh data on separate threads and calling sg_make_buffer inside the sapp frame callback. This works fine but I noticed that the make_buffer calls started to take a lot of CPU time (the mesh data is quite big in size) scaling (seemingly) with the total buffer count.

So I tried calling sg_alloc_buffer on the main thread and then sg_init_buffer inside the mesh data worker threads. But I either get a silent crash or this assert gets triggered: Assertion failed: (((HRESULT)(hr)) >= 0) && buf->d3d11_buf, file sokol_gfx.h, line 6734

I had made a guess that sg_init_buffer is safe to call from different threads but that does not seem to be the case?

I didn't extract a minimal repro right now but if this is supposed to work then I can try. My setup consists of sokol_app and sokol_gfx with the D3D11 backend and win32 threads.

floooh commented 4 years ago

So I tried calling sg_alloc_buffer on the main thread and then sg_init_buffer inside the mesh data worker threads.

sokol_app.h and sokol_gfx.h expect that all functions are called from the same thread, calling functions from different threads will most likely result in data corruption.

The loadpng-sapp sample is an example how the asynchronous resource creation is intended to be used:

First a handle is allocated via sg_alloc_texture(), and an asynchronous IO operation is started which loads the texture data (this is actually happening in a separate thread on non-WASM platforms):

https://github.com/floooh/sokol-samples/blob/91b9b4e48d21a9b6906ebcebe38f815a63d6e942/sapp/loadpng-sapp.c#L66-L71

https://github.com/floooh/sokol-samples/blob/91b9b4e48d21a9b6906ebcebe38f815a63d6e942/sapp/loadpng-sapp.c#L149-L160

When the texture data has finished loading, the texture is initialized via sg_init_texture() but the IO callback where this happens is running on the main thread too:

https://github.com/floooh/sokol-samples/blob/91b9b4e48d21a9b6906ebcebe38f815a63d6e942/sapp/loadpng-sapp.c#L180-L190

So basically, you can load and prepare the mesh data on a separate thread, but when this is done you need to pass the data back to the thread where sokol-gfx is running and create the resource there.

floooh commented 4 years ago

PS: I guess this won't solve the problem with sg_make_buffer() being slow though because you already prepared the data on a separate thread. It's surprising that D3D11's CreateBuffer function would be so slow that it is noticeable (I had expected that shader and pipeline creation on the main thread would be the bigger problem).

Also the D3D11 buffer creation code in sokol_gfx.h doesn't have any loops which would cause it to get slower with the total number of buffers, so I guess it must be something in D3D itself.

https://github.com/floooh/sokol/blob/d4b3a599b95d7892dc6de3f6dedc0888f9ced1d1/sokol_gfx.h#L6702-L6737

How big are the buffers you're creating, and how many are you creating?

And did you test with NDEBUG defined to make sure the D3D validation is disabled? (see here: https://github.com/floooh/sokol/blob/d4b3a599b95d7892dc6de3f6dedc0888f9ced1d1/sokol_app.h#L3540-L3542).

Visual Studio should handle NDEBUG on/off automatically when switching between Debug and Release mode, but I'm not sure about other compilers or build systems (fips handles it too for all platforms and compilers).

floatms commented 4 years ago

Okay so I've gathered some data. I'm creating a total of 1024 buffers. The max vertex count I got was 54840 and an average of 30065. This is because I'm generating the meshes procedurally (16 16 256 sized chunks of voxel terrain per buffer). The positions buffer is 1 float per vertex (3 world position integer offsets quantized). There is also a color buffer which is 4 floats per vertex (not optimized yet). The index buffer is 1.5x the vertex count (uint32 index-type).

When turning on NDEBUG I'm getting some slight improvements. Hard to tell. Here are two gists with the timings of compiling without and with NDEBUG enabled. I'm capping the make_buffer calls at 10 per frame and adding up the times of each call (with sokol_time). The 0.0 you see at the beginning are frames where there is no mesh data ready. The timings are in milliseconds. Without NDEBUG: https://gist.github.com/floatms/ac4cc0d5b78faf1cec100746d44202ba With NDEBUG: https://gist.github.com/floatms/2b31b9d130b09e2445ff1e78d43eb54b

At the bottom just before all the buffers are created (streaks of 0.0) you can tell that when NDEBUG is defined, the timings are a lot more stable and are overall not increasing much compared to the first few frames. When NDEBUG is not defined on the other hand, there are a lot more outliers and the trend seems to increase.

So NDEBUG helps but I feel like the time some of these calls take is still way too high.

Visual Studio should handle NDEBUG on/off automatically when switching between Debug and Release mode, but I'm not sure about other compilers or build systems (fips handles it too for all platforms and compilers).

I'm actually just using a batch-file + clang. So pretty low-tech.

floatms commented 4 years ago

Got a bit ahead of myself there.

sokol_app.h and sokol_gfx.h expect that all functions are called from the same thread, calling functions from different threads will most likely result in data corruption.

Okay thanks, so this clears things up. Is this a limitation of the backends or internal to sokol in the case of sg_make_buffer?

I guess this won't solve the problem with sg_make_buffer() being slow though because you already prepared the data on a separate thread.

You're right, preparing the data is plenty fast now due to multi-threading (still room for optimization though) but the draw-state setup became the bottleneck, surprisingly.

floatms commented 4 years ago

Quick update. This issue has gotten a bit off-topic but I believe that this problem might be worth investigating, as it may affect some use cases with the D3D11 backend. I made a gist with a "minimal" repro which demonstrates the effect: https://gist.github.com/floatms/f715f53b3ff48c88a4545343da1305da

This is the sapp_quad example but with much more mesh data. The buffers are not even filled/rendered completely but the stall during buffer initialization is still there. Init times are measured with sokol_time and printf'd. There are some #defines at the top of the file that can be changed (they all have descriptive comments). Some examples:

I found this link while researching the issue: https://www.reddit.com/r/GraphicsProgramming/comments/7zm72t/any_way_to_reduce_stall_when_allocating_huge/ The comments are not quite clear on the issue but some suggest that it might have to do with exceeding the GPU memory limits. RenderDoc tells me that I there's just short of 1 GB of GPU memory in use when buffer sizes are set to average and without quantization but that is not near the limit at all.

I had wanted to profile with MSVC but for some stupid reason it refuses to display profile reports without a recent version of Internet Explorer installed...which I can't provide because Windows rejects the needed updates :)

So this is all I got. The obvious way to avoid this issue is to reduce buffer count and size but that is a quite frustrating solution.

floooh commented 4 years ago

Hi, sorry for not having looked at the issue for so long, and thanks for all the investigative work. Sooner or later I need to properly tackle the asynchronous resource creation issue in sokol-gfx, I'll just keep this issue open as a reminder and place to add more requirements / findings :)

floatms commented 4 years ago

Hey, no problem. I just wanted to get all of this out there in case it is helpful to the next person.

Since then I found out that changing this line: https://github.com/floooh/sokol/blob/d4b3a599b95d7892dc6de3f6dedc0888f9ced1d1/sokol_app.h#L3539 to not include the D3D11_CREATE_DEVICE_SINGLETHREADED flag is all that was needed to make multi-threaded resource creation work (for buffers at least). The MSDN docs mention that this can potentially decrease performance so it would probably make sense to offer a platform specific flag that can be activated to turn on async support.

Unfortunately I can't really help with the other back-ends.