Move Camera Vertex Transform to GPU

AdamsLair / duality

a 2D Game Development Framework

https://adamslair.github.io/duality

MIT License

1.41k stars 287 forks source link

Move Camera Vertex Transform to GPU #219

Closed ilexp closed 6 years ago

ilexp commented 9 years ago

Right now, parts of the vertex transformation in rendering happens on the CPU using PreprocessCoords or manually:

Transforming world coordinates to camera coordinates
Applying object-local scale depending on camera distance
Rotating the object locally around its center

This approach has several problems:

Typical GPU work is done on the CPU with poor performance.
Vertex Shaders never have access to the original world coordinates of a vertex and cannot adjust or react to them in a meaningful way.

But it also solves the following problem:

The fake perspective that is used in parallax 2D projection requires to scale objects around their local center / origin. No global transformation can properly transform all incoming vertices at once, so a per-object transformation is required. However, per-object uniform data would prevent efficient batch rendering and actually make performance worse.

If there is a way to solve this using a GPU vertex transform approach, there's no reason not to move all vertex transform calculations to the GPU for better shader support and performance. Customized solutions could still be implemented using custom shaders.

ilexp commented 9 years ago

In order to perform a full GPU transform of vertex data, the following setup would be required in the vertex shader:

// Camera-constant data
uniform    vec3  camPos;       // Position of the camera in world coordinates
uniform    float camFocusDist; // FocusDist of the camera.
uniform    mat2  camRotation;  // Transformation matrix of the camera's Z rotation

// Object-local data
attribute  vec3  vertexLocal;  // Object-local vertex position
uniform    vec3  objPos;       // Position of the object in world coordinates
uniform    float objRotation;  // Object-local rotation

// Draft of the main operations to perform
void main()
{
    // Determine object scale based on camera properties and relative object position
    float objScale = camFocusDist / (objPos.z - camPos.z);

    // Transform local vertex coords to include rotation and scale
    float rotateSin = sin(objRotation);
    float rotateCos = cos(objRotation);
    vec3 localPos = vec3(
        vertexLocal.x * cos - vertexLocal.y * sin * objScale,
        vertexLocal.x * sin + vertexLocal.y * cos * objScale,
        vertexLocal.z);

    // Determine vertex world position
    vec3 worldPos = localPos + objPos;

    // Transform vertex to view coordinates and account for camera rotation
    vec3 viewPos = worldPos - camPos;
    viewPos = vec3(camRotation * viewPos.xy, viewPos.z);

    // Do OpenGL ortho projection
    gl_Position = gl_ProjectionMatrix * viewPos;
}

While Camera-local data is constant after setting up the Camera's RenderPass, object-specific data changes on average every four vertices. Without a very efficient way to store them, this will be the main culprit.

Problems:

Calling glUniform a few times after every four vertices absolutely kills batching.
- Investigate OpenGL Uniform buffers and similar concepts. If possible, limit this to OpenGL ES 2.0 supported features.
- According to docs.gl, Uniform buffers are unavailable in OpenGL 2.1 and ES 2.0 and are first supported in OpenGL 3.0 and ES 3.0. Not supporting OpenGL 2.1 might not be a problem given its age, but OpenGL ES 3.0 seems like a "big" requirement...?
Duality currently isn't very efficient in storing uniform data material-wise, especially not the kind of uniform data that changes per-object. Creating a new BatchInfo for every object is not a viable option. There needs to be a way to specify "temporary" uniform data per AddVertices call.
- It needs to be super-fast. Seriously. If this should have a chance to become the new default for sprites (99% of objects), this needs to be lightspeed.

These problems require further consideration before this issue can be solved.

/cc @BraveSirAndrew with a vague feeling that he might have a solid opinion or experience with this kind of thing.

ilexp commented 9 years ago

Calling glUniform a few times after every four vertices absolutely kills batching.

Investigate OpenGL Uniform buffers and similar concepts. If possible, limit this to OpenGL ES 2.0 supported features.

According to docs.gl, Uniform buffers are unavailable in OpenGL 2.1 and ES 2.0 and are first supported in OpenGL 3.0 and ES 3.0. Not supporting OpenGL 2.1 might not be a problem given its age, but OpenGL ES 3.0 seems like a "big" requirement...?

Duality currently isn't very efficient in storing uniform data material-wise, especially not the kind of uniform data that changes per-object. Creating a new BatchInfo for every object is not a viable option. There needs to be a way to specify "temporary" uniform data per AddVertices call.

It needs to be super-fast. Seriously. If this should have a chance to become the new default for sprites (99% of objects), this needs to be lightspeed.

One way to solve these problems would be to store all object-local data in vertex attributes. This would heavily increase data load, but at the same time solve the batching problem and circumvent the API problem. As an optimization, object-local rotation could be performed on the CPU like it is implemented now.

The default vertex format here would then be:

Vector3 LocalPosition; // 12 bytes
Vector3 ObjPosition;   // 12 bytes
Vector2 TexCoord;      // 8 bytes
ColorRgba Color;       // 4 bytes

// Total:  36 bytes per vertex
// Before: 24 bytes per vertex

Which would be 12 bytes larger than before. Also, 36 bytes is quite a bit for this kind of simple 2D data. Potentially, it could be optimized to this:

Vector2 LocalPosition; // 8 bytes
Vector3 ObjPosition;   // 12 bytes
Vector2h TexCoord;     // 4 bytes
ColorRgba Color;       // 4 bytes

// Total:      36 bytes per vertex
// Compressed: 28 bytes per vertex
// Before:     24 bytes per vertex

ilexp commented 9 years ago

Another problem with this approach and especially the above vertex shader is the fact that all existing transformations need to be expressed within its configuration. Both screen overlay and world rendering need to be able to take the same rendering path, because all Materials should be equally usable in both modes - without having all in two versions and picking "the right one".

An updated version of the above shader (including the removed object rotation and the potential uniform to attribute change) could look like this:

// Camera-constant data
uniform    vec3  camPos;       // Position of the camera in world coordinates
uniform    float camFocusDist; // FocusDist of the camera.
uniform    mat2  camRotation;  // Transformation matrix of the camera's Z rotation
uniform    bool camOnScreen; // If true, screen transformation is used

// Object-local data
attribute  vec3  vertexLocal;  // Object-local vertex position
attribute  vec3  objPos;       // Position of the object in world coordinates

// Draft of the main operations to perform
void main()
{
    vec3 viewPos;

    if (!camOnScreen)
    {
        // Determine object scale based on camera properties and relative object position
        float objScale = camFocusDist / (objPos.z - camPos.z);

        // Transform local vertex coords to include local scale
        vec3 localPos = vec3(
            vertexLocal.xy * objScale,
            vertexLocal.z);

        // Determine vertex world position
        vec3 worldPos = localPos + objPos;

        // Transform vertex to view coordinates and account for camera rotation
        viewPos = worldPos - camPos;
        viewPos = vec3(camRotation * viewPos.xy, viewPos.z);
    }
    else
    {
        // In on-screen mode, just forward the raw positions into view space
        viewPos = objPos + vertexLocal;
    }

    // Do OpenGL ortho projection
    gl_Position = gl_ProjectionMatrix * viewPos;
}

Note that, in on-screen mode, none of the camera-related uniforms are used at all.

ilexp commented 9 years ago

Adding to the above (solved) problem, the same shader would also need to be configurable to support flat / non-parallax rendering:

// Camera-constant data
uniform    vec3  camPos;       // Position of the camera in world coordinates
uniform    float camFocusDist; // FocusDist of the camera.
uniform    mat2  camRotation;  // Transformation matrix of the camera's Z rotation
uniform    bool  camParallax;  // If true, 2D parallax projection is applied by the camera

// Object-local data
attribute  vec3  objPos;       // Position of the object in world coordinates

// Vertex-local data
attribute  vec3  vertexLocal;  // Object-local vertex position

// Draft of the main operations to perform
void main()
{
    vec3 localPos = vertexLocal;

    // Apply parallax 2D projection
    if (camParallax)
    {
        // Determine object scale based on camera properties and relative object position
        float objScale = camFocusDist / (objPos.z - camPos.z);

        // Transform local vertex coords to include local scale
        localPos.xy *= objScale;
    }

    // Determine vertex world position
    vec3 worldPos = localPos + objPos;

    // Transform vertex to view coordinates and account for camera rotation
    vec3 viewPos = worldPos - camPos;
    viewPos = vec3(camRotation * viewPos.xy, viewPos.z);

    // Do OpenGL ortho projection
    gl_Position = gl_ProjectionMatrix * viewPos;
}

In this setup, all projection / rendering modes are supported:

"World-space" parallax 2D rendering is active by default.
"World-space" non-parallax / flat rendering is active when setting the camParallax uniform to false.
"Screen-space" rendering is active when setting the camParallax uniform to false and specifying camPos to be (0, 0, 0).
Also note that the above shader code is, in the context of the Duality rendering setup, can be configured to be 100% equivalent with the current minimal ftransform shader:
- Camera Rotation can be applied (optional)
- Projection is applied
- All else can be disabled via parameter

So, up to this point, the main issue of both shader- and Duality API considerations seems to be how to transfer object data to the shader in a generic, re-usable way, and how to do so most efficiently.

ilexp commented 9 years ago

With regard to the memory bandwith issue when storing object data in vertex attributes, here's a comparison to put it into perspective:

Duality 2D Vertex Format:

Vector2 LocalPosition; // 8 bytes
Vector3 ObjPosition;   // 12 bytes
Vector2h TexCoord;     // 4 bytes
ColorRgba Color;       // 4 bytes

// Total:      36 bytes per vertex
// Compressed: 28 bytes per vertex
// Before:     24 bytes per vertex

Somewhat Minimal 3D Game Vertex Format:

Vector3 Position;      // 12 bytes
Vector3h Normal;       // 6 bytes
Vector3h Tangent;      // 6 bytes
Vector2h TexCoord;     // 4 bytes

// Total:      44 bytes per vertex
// Compressed: 28 bytes per vertex

It doesn't seem like that big of big deal in comparison. This might be the way to go here.

Edit: Assuming a game scene with 10000 visible sprites, that would be 40000 vertices per frame. Even when assuming the uncompressed variant with 36 bytes per vertex, that would only be only around 82 MB per second bandwidth. However, even the old PCI Express 2.0 has a total bandwidth between 500 MB and 8 GB per second. The above vertex size seems to be totally manageable. Not sure about mobile platforms though - any insight appreciated.

BraveSirAndrew commented 9 years ago

Hi Adam

I think that the correct way to handle the per-object data in this case would be to use separate streams of data. You could leave the existing vertex formats alone and add another stream of vertex data for object position, rotation, and scale. You can set a divisor on streams in OpenGL so you could say that the GL should only update its index into this second stream for every four vertices processed. That way you're reducing the extra load to only (12 bytes for position + 4 bytes for rotation + 4 bytes for scale) * 10000 = 200k on top of your normal data for 10000 sprites, which is nothing at all! I wouldn't even worry about that on mobile platforms.

ilexp commented 9 years ago

You can set a divisor on streams in OpenGL so you could say that the GL should only update its index into this second stream for every four vertices processed.

This is exactly the kind of thing that I was looking for - a hardcoded one-object-has-four-vertices solution probably won't suffice as a general-purpose method, but if there was a way to just specify an index per vertex, which could then be used to lookup some object data from a buffer, this would certainly reduce data load and provide an opportunity for specifying even more complex per-object data.

I'm still doing some research on this, but do you happen to know what keyword I should be looking for?

Edit: Actually, when modifying this to provide "per-primitive-data", telling OpenGL to update its index every X vertices would be kind of a general-purpose solution. All it would take would be to extend the AddVertices and IDrawBatch / internal DrawBatch<T> API to include a second per-primitive stream and all the rest could be done by the backend. Not sure how that would affect vertex upload performance though, since every batch would then require a binding swap and two consecutive uploads - I suppose this shouldn't have a noticeable effect.

Edit: After looking a bit into this, multiple sources tell me that specifying vertex data per-primitive or specifying distinct index buffers for different attributes isn't really possible unless using GL3.x buffer textures with negative performance implications. If nothing else turns up, I guess I'm back at the initial solution of specifying object data per vertex. :|

Edit: Found the divisor command. It's only available in OpenGL 3.3 and ES 3.0. OpenGL 3.3 is fine for desktop machines, but ES 3.0 worries me a little. Using this as a base requirement, it would rule out most mobile devices. Fallback code in the backend could upload the same vertex data N times, but I'm not sure if it's a great idea to spam OpenGL calls like that, so that fallback probably isn't that good. Another one might be to expand that vertex data on the CPU before submitting it, which isn't that great either, especially when this is only done on devices that aren't very powerful in the first place.

Edit: It also seems that the divisor feature is only available when doing instanced rendering, not on a regular / continuous stream of vertices (?), which might be an issue.

ilexp commented 9 years ago

When taking into account advanced shaders such as lighting, they also require information about an objects local rotation, so they can interpret its normalmap accordingly.

In these cases, an advanced vertex format could be used, but incidentally, object rotation was also part of the initial vertex format draft. So, maybe it does have its place there, as would object-local rotation in the shader:

// Per-Object / Per-Primitive data
Vector3 ObjPosition;   // 12 bytes
Half ObjRotation       // 2 bytes

// Actual Per-Vertex data
Vector2 LocalPosition; // 8 bytes
Vector2h TexCoord;     // 4 bytes
ColorRgba Color;       // 4 bytes

// Total:      40 bytes per vertex
// Compressed: 30 bytes per vertex
// Before:     24 bytes per vertex

Updated shader:

// Camera-constant data
uniform    vec3  camPos;       // Position of the camera in world coordinates
uniform    float camFocusDist; // FocusDist of the camera.
uniform    mat2  camRotation;  // Transformation matrix of the camera's Z rotation
uniform    bool  camParallax;  // If true, 2D parallax projection is applied by the camera

// Object-local data
attribute  vec3  objPos;       // Position of the object in world coordinates
attribute  float objRot;       // Rotation of the object in degree (to better use Half Float precision)

// Vertex-local data
attribute  vec3  vertexLocal;  // Object-local vertex position

// Draft of the main operations to perform
void main()
{
    vec3 localPos = vertexLocal;

    // Apply parallax 2D projection
    if (camParallax)
    {
        // Determine object scale based on camera properties and relative object position
        float objScale = camFocusDist / (objPos.z - camPos.z);

        // Transform local vertex coords according to parallax scale
        localPos.xy *= objScale;
    }

    // Apply local object rotation to vertex coords 
    float objRotRadians = radians(objRot);
    float rotSin = sin(objRotRadians);
    float rotCos = cos(objRotRadians);
    vec3 localPos = vec3(
        vertexLocal.x * rotCos - vertexLocal.y * rotSin,
        vertexLocal.x * rotSin + vertexLocal.y * rotCos,
        vertexLocal.z);

    // Determine vertex world position
    vec3 worldPos = localPos + objPos;

    // Transform vertex to view coordinates and account for camera rotation
    vec3 viewPos = worldPos - camPos;
    viewPos = vec3(camRotation * viewPos.xy, viewPos.z);

    // Do OpenGL ortho projection
    gl_Position = gl_ProjectionMatrix * viewPos;
}

With the vertex format growing again despite compression efforts, storing per-object / per-primitive data beside vertex data like this should be considered really, really carefully. Continuing to look out for alternatives.

ilexp commented 9 years ago

Can a TexCoord really be compressed using half floats? A Half Float has a precision of three decimals between zero and one. However, when assuming a sprite sheet > 1024², the required precision to address each texel is clearly higher than three decimals. In 2D games, where some of them will require pixel-perfect rendering, this is not viable. Therefore, TexCoord needs to use a higher precision.

With that change, the only attribute left compressed is the object rotation, which only saves two bytes. Might as well use full precision and store rotations directly in radians then, with the added benefit of clarity and not requiring to introduce Half Float types to DualityPrimitives, as well as requiring OpenGL support for them.

// Per-Object / Per-Primitive data
Vector3 ObjPosition;   // 12 bytes
float ObjRotation      // 4 bytes

// Actual Per-Vertex data
Vector2 LocalPosition; // 8 bytes
Vector2 TexCoord;      // 8 bytes
ColorRgba Color;       // 4 bytes

// Total:      36 bytes per vertex
// Before:     24 bytes per vertex

Maybe I've just grown accustomed to this data growth, but 36 bytes per vertex doesn't seem that bad at this point. Feedback by graphics programmers appreciated.

ilexp commented 9 years ago

All this vertex format extension stuff doesn't sound that great. Let's take a step back:

When applying object-local rotation and scale in software in the ICmpRenderer, all that's left to transform is everything relative to the Camera: Position, parallax scale and rotation.
All of this can be applied in the vertex shader without additional information. So why not do it this way? Keeping object-local transforms on the software side, but doing all Camera- and Perspective-related stuff clearly in hardware.
This sounds much better. Leave the old vertex format alone.

ilexp commented 9 years ago

So, since additional information is no longer required, here's the updated shader:

// Camera-constant data
uniform    vec3  camPos;         // Position of the camera in world coordinates
uniform    float camZoom;        // Zoom factor of the camera.
uniform    mat2  camRotation;    // Transformation matrix of the camera's Z rotation
uniform    bool  camParallax;    // If true, 2D parallax projection is applied by the camera

// Vertex data
attribute  vec3  vertexWorldPos; // The world position of the vertex
attribute  float vertexZOffset;  // Optional: The (sorting) Z offset that shouldn't affect parallax scale

// Draft of the main operations to perform
void main()
{
    // This could be moved to a Duality-builtin vertex shader function which
    // transforms a world coordinate into a view coordinate.
    {
        // Apply parallax 2D projection
        float parallaxScale;
        if (camParallax)
        {
            // Determine object scale based on camera properties and relative vertex position
            parallaxScale = camZoom / (vertexWorldPos.z - camPos.z);
        }
        else
        {
            // Apply a global scale factor
            parallaxScale = camZoom;
        }

        // Transform vertex to view coordinates and account for parallax scale, 
        // camera rotation and Z-offset
        vec3 viewPos = worldPos - camPos;
        viewPos.xy *= parallaxScale;
        viewPos = vec3(camRotation * viewPos.xy, viewPos.z + vertexZOffset);
    }

    // Do OpenGL ortho projection
    gl_Position = gl_ProjectionMatrix * viewPos;
}

The Z offset in the above shader would be an optional vertex attribute, so non-parallax depth sorting offsets can still be added. If not specified in the vertex stream, its value would naturally fall back to zero.

Note that IVertexData and DrawBatch<T> will need to be adjusted to account for the fact that Z offset is now a distinct attribute, and no longer included in the Pos.Z coordinate. The Canvas class might need to be adjusted as well.

New default vertex format which specifies them:

Vector3   Position; // 12 bytes
Vector2   TexCoord; // 8 bytes
ColorRgba Color;    // 4 bytes
float     Offset;   // 4 bytes  [Optional]

// Total:  28 bytes per vertex
// Before: 24 bytes per vertex

As an additional improvement, Duality shaders could be updated to feature builtin functions (besides the already existing builtin uniforms), which could provide a standard vertex transformation. This would add some more flexibility to change the exact transformation code later while still keeping old shader code working.

ilexp commented 9 years ago

Implications of the above draft:

No more PreprocessCoords: Improved usability. Just specify an objects vertices in world coordinates and be done with it.
Shader access to world coordinates: Will make a lot of (vertex) shader operations easier to use and more intuitive.
Improved performance: Less work done on the CPU in ICmpRenderer Components, more work done by the GPU, which doesn't really mind anyway here.
The transformation function in the vertex shader decides how exactly coordinate transformation is done. More flexibility!

Usability++ Performance++ Cleanliness+

ilexp commented 7 years ago

It should be possible to test the new transform and shader as a heads-up without changing anything in the core:

Define a custom ICmpRenderer that submits vertices in world space. Don't do any software transform.
Use a special material that has all the required uniforms and set them in the renderer.
Use a special shader that does GPU-side vertex transform.

ilexp commented 7 years ago