Draw calls performance bottleneck

bvssvni commented 8 years ago

This issue is to help people understanding the picture of what causes a performance bottleneck in piston-graphics, and what the plan is to fix it.

For each draw call, the CPU need to send data to the GPU. The GPU is often very fast at rendering. If this capacity is not used fully, the GPU sits and waits for more input from the CPU.

GPUs are designed for handling massive amounts of data with a limited set of variation. What the GPU does is controlled through a shader language. For OpenGL the shader language is GLSL.

When you render a rectangle, this is what happens:

Transformed triangles are created on the CPU in chunks and sent to the graphics backend.
The backend writes the received data to dynamic buffers.
The graphics driver tells the GPU to render using the updated buffers.
The GPU renders using a precompiled shader and paint pixels in the frame buffer.
The frame buffer is swapped with the current one to update the display

Step 1-4 happens repeatedly when drawing many objects for each frame.

In the Gfx backend the draw commands are collected upfront and given to the driver at the same time. However, from the graphics driver side, the instructions seems similar to the ones generated by the OpenGL backend (except for changes made the draw state).

The 1st step is done by piston-graphics's design. Reasons to triangulate on the CPU:

take advantage of f64 precision of matrix transformations, which are not that frequently supported on the GPU
non-linear transforms, such as least square deforming
easy to implement a new backend, with potential for one using software rasterization
great flexibility for the generic libraries depending on piston-graphics

Some questions one might ask:

Is the 1st step or 2nd step the primary cause of the bottleneck?
Is it possible to reduce the overhead without making design tradeoffs?

Before making changes to the design, one might consider using the strengths it offers to fix the problem. It seems the largest overhead is the number of draw calls, and since reducing the number of draw calls will lead to less overhead, we should looks for ways to do that first. This happens in the 2nd step, not the 1st!

Batch, batch, batch!

The key insight here is that since piston-graphics triangulates on the CPU, we could pack multiple shapes into the same buffer in the backend. This leads to fewer draw calls when:

The draw state is the same between calls
The same shader program is used (colored vs textured)
The same texture is used

One downside is that many backend instances leads to higher memory usage. Based on experience so far most applications only use one instance, so I do not think this is a problem.

For example, in Conrod a lot of solid colored shapes are rendered, then some textured shapes (text) and then more solid colored shapes etc. Currently the CharacterCache backends rasterizes glyphs using Freetype for each character in a separate texture. This means we can reduce the number of draw calls for solid shapes, but not for text.

In the case of text, we could try two different approaches:

Pack glyphs in a single image and update a texture
Since glyphs are often of similar size, consider using texture arrays

Number one seems sensible to test first because it would benefit from the same reduction of draw calls. However, it requires some changes:

Character should take &'a T to the texture, separating offset and size from texture storage internally in the glyph cache
Change CharacterCache::character to return Character<'a, T>
Alternative: Retained API

By organizing graphic primitives into a tree structure, one can traverse it and optimize the draw calls.

While this would be very interesting to work on, there are some major obstacles/unknowns:

One might want to use one tree structure for both 2D and 3D
Judging from previous projects, such as Ogre3D, writing a tree structure for 3D is complicated and might take years to mature, and would benefit from new language features in Rust to be extensible
It is not obvious why a retained API should be much faster than an immediate design writing directly into large buffers, because it depends on the method of sorting/optimization and input data
Seem to be a better idea to design a retained API based on project experience, rather than to fix a single optimization problem with 2D
Summary of plan
1. Change design of Character and CharacterCache
2. Write to larger buffers in the backends

I believe this plan requires minimum effort and least amount of breaking changes. We keep the same overall design of piston-graphics and the existing benefits.

mitchmindtree commented 8 years ago

Sounds great :+1:

bvssvni commented 8 years ago

This also requires changing the shaders from using a uniform color to one color per vertex. Triangles from different shapes gets packed into the same buffer, so their color must be separated.

crumblingstatue commented 8 years ago

I absolutely love https://love2d.org/wiki/SpriteBatch. It would be nice to have a similar feature in Piston. It's kind of off-putting when your Rust game runs slower than Lua because of the drawing overhead.

bvssvni commented 8 years ago

@crumblingstatue Can you open a new issue about it? Thanks!

crumblingstatue commented 8 years ago

@crumblingstatue Can you open a new issue about it? Thanks!

Alright, I opened #1041.

ishitatsuyuki commented 7 years ago

I've found the text renderer horrible. The minimal overhead is about 23 calls/frame (rusttype's gpu_cache example).

However, Piston doesn't batch it at all, do many context switches like enabling and disabling scissors. This resulted in 1000 calls/frame (and due to the Text implementation, it can increase further with more characters).

This is 50x slowdown. Not really affordable.

bvssvni commented 7 years ago

@ishitatsuyuki Yeah, text rendering is really bad right now.

KongouDesu commented 6 years ago

What's the current state of this issue, especially in regard to text rendering?

bvssvni commented 6 years ago

Texture rendering is now significantly faster for the OpenGL backend, but the glyph cache implementation must be changed to take advantage of this optimization.

PistonDevelopers / graphics

Draw calls performance bottleneck #1026

Alternative: Retained API

Summary of plan