Closed ilexp closed 6 years ago
In order to perform a full GPU transform of vertex data, the following setup would be required in the vertex shader:
// Camera-constant data
uniform vec3 camPos; // Position of the camera in world coordinates
uniform float camFocusDist; // FocusDist of the camera.
uniform mat2 camRotation; // Transformation matrix of the camera's Z rotation
// Object-local data
attribute vec3 vertexLocal; // Object-local vertex position
uniform vec3 objPos; // Position of the object in world coordinates
uniform float objRotation; // Object-local rotation
// Draft of the main operations to perform
void main()
{
// Determine object scale based on camera properties and relative object position
float objScale = camFocusDist / (objPos.z - camPos.z);
// Transform local vertex coords to include rotation and scale
float rotateSin = sin(objRotation);
float rotateCos = cos(objRotation);
vec3 localPos = vec3(
vertexLocal.x * cos - vertexLocal.y * sin * objScale,
vertexLocal.x * sin + vertexLocal.y * cos * objScale,
vertexLocal.z);
// Determine vertex world position
vec3 worldPos = localPos + objPos;
// Transform vertex to view coordinates and account for camera rotation
vec3 viewPos = worldPos - camPos;
viewPos = vec3(camRotation * viewPos.xy, viewPos.z);
// Do OpenGL ortho projection
gl_Position = gl_ProjectionMatrix * viewPos;
}
While Camera-local data is constant after setting up the Camera's RenderPass, object-specific data changes on average every four vertices. Without a very efficient way to store them, this will be the main culprit.
Problems:
glUniform
a few times after every four vertices absolutely kills batching.
AddVertices
call.
These problems require further consideration before this issue can be solved.
/cc @BraveSirAndrew with a vague feeling that he might have a solid opinion or experience with this kind of thing.
- Calling
glUniform
a few times after every four vertices absolutely kills batching.
- Investigate OpenGL Uniform buffers and similar concepts. If possible, limit this to OpenGL ES 2.0 supported features.
- According to docs.gl, Uniform buffers are unavailable in OpenGL 2.1 and ES 2.0 and are first supported in OpenGL 3.0 and ES 3.0. Not supporting OpenGL 2.1 might not be a problem given its age, but OpenGL ES 3.0 seems like a "big" requirement...?
- Duality currently isn't very efficient in storing uniform data material-wise, especially not the kind of uniform data that changes per-object. Creating a new BatchInfo for every object is not a viable option. There needs to be a way to specify "temporary" uniform data per
AddVertices
call.
- It needs to be super-fast. Seriously. If this should have a chance to become the new default for sprites (99% of objects), this needs to be lightspeed.
One way to solve these problems would be to store all object-local data in vertex attributes. This would heavily increase data load, but at the same time solve the batching problem and circumvent the API problem. As an optimization, object-local rotation could be performed on the CPU like it is implemented now.
The default vertex format here would then be:
Vector3 LocalPosition; // 12 bytes
Vector3 ObjPosition; // 12 bytes
Vector2 TexCoord; // 8 bytes
ColorRgba Color; // 4 bytes
// Total: 36 bytes per vertex
// Before: 24 bytes per vertex
Which would be 12 bytes larger than before. Also, 36 bytes is quite a bit for this kind of simple 2D data. Potentially, it could be optimized to this:
Vector2 LocalPosition; // 8 bytes
Vector3 ObjPosition; // 12 bytes
Vector2h TexCoord; // 4 bytes
ColorRgba Color; // 4 bytes
// Total: 36 bytes per vertex
// Compressed: 28 bytes per vertex
// Before: 24 bytes per vertex
Another problem with this approach and especially the above vertex shader is the fact that all existing transformations need to be expressed within its configuration. Both screen overlay and world rendering need to be able to take the same rendering path, because all Materials should be equally usable in both modes - without having all in two versions and picking "the right one".
An updated version of the above shader (including the removed object rotation and the potential uniform to attribute change) could look like this:
// Camera-constant data
uniform vec3 camPos; // Position of the camera in world coordinates
uniform float camFocusDist; // FocusDist of the camera.
uniform mat2 camRotation; // Transformation matrix of the camera's Z rotation
uniform bool camOnScreen; // If true, screen transformation is used
// Object-local data
attribute vec3 vertexLocal; // Object-local vertex position
attribute vec3 objPos; // Position of the object in world coordinates
// Draft of the main operations to perform
void main()
{
vec3 viewPos;
if (!camOnScreen)
{
// Determine object scale based on camera properties and relative object position
float objScale = camFocusDist / (objPos.z - camPos.z);
// Transform local vertex coords to include local scale
vec3 localPos = vec3(
vertexLocal.xy * objScale,
vertexLocal.z);
// Determine vertex world position
vec3 worldPos = localPos + objPos;
// Transform vertex to view coordinates and account for camera rotation
viewPos = worldPos - camPos;
viewPos = vec3(camRotation * viewPos.xy, viewPos.z);
}
else
{
// In on-screen mode, just forward the raw positions into view space
viewPos = objPos + vertexLocal;
}
// Do OpenGL ortho projection
gl_Position = gl_ProjectionMatrix * viewPos;
}
Note that, in on-screen mode, none of the camera-related uniforms are used at all.
Adding to the above (solved) problem, the same shader would also need to be configurable to support flat / non-parallax rendering:
// Camera-constant data
uniform vec3 camPos; // Position of the camera in world coordinates
uniform float camFocusDist; // FocusDist of the camera.
uniform mat2 camRotation; // Transformation matrix of the camera's Z rotation
uniform bool camParallax; // If true, 2D parallax projection is applied by the camera
// Object-local data
attribute vec3 objPos; // Position of the object in world coordinates
// Vertex-local data
attribute vec3 vertexLocal; // Object-local vertex position
// Draft of the main operations to perform
void main()
{
vec3 localPos = vertexLocal;
// Apply parallax 2D projection
if (camParallax)
{
// Determine object scale based on camera properties and relative object position
float objScale = camFocusDist / (objPos.z - camPos.z);
// Transform local vertex coords to include local scale
localPos.xy *= objScale;
}
// Determine vertex world position
vec3 worldPos = localPos + objPos;
// Transform vertex to view coordinates and account for camera rotation
vec3 viewPos = worldPos - camPos;
viewPos = vec3(camRotation * viewPos.xy, viewPos.z);
// Do OpenGL ortho projection
gl_Position = gl_ProjectionMatrix * viewPos;
}
In this setup, all projection / rendering modes are supported:
camParallax
uniform to false.camParallax
uniform to false and specifying camPos
to be (0, 0, 0)
.ftransform
shader:
So, up to this point, the main issue of both shader- and Duality API considerations seems to be how to transfer object data to the shader in a generic, re-usable way, and how to do so most efficiently.
With regard to the memory bandwith issue when storing object data in vertex attributes, here's a comparison to put it into perspective:
Duality 2D Vertex Format:
Vector2 LocalPosition; // 8 bytes
Vector3 ObjPosition; // 12 bytes
Vector2h TexCoord; // 4 bytes
ColorRgba Color; // 4 bytes
// Total: 36 bytes per vertex
// Compressed: 28 bytes per vertex
// Before: 24 bytes per vertex
Somewhat Minimal 3D Game Vertex Format:
Vector3 Position; // 12 bytes
Vector3h Normal; // 6 bytes
Vector3h Tangent; // 6 bytes
Vector2h TexCoord; // 4 bytes
// Total: 44 bytes per vertex
// Compressed: 28 bytes per vertex
It doesn't seem like that big of big deal in comparison. This might be the way to go here.
Edit: Assuming a game scene with 10000 visible sprites, that would be 40000 vertices per frame. Even when assuming the uncompressed variant with 36 bytes per vertex, that would only be only around 82 MB per second bandwidth. However, even the old PCI Express 2.0 has a total bandwidth between 500 MB and 8 GB per second. The above vertex size seems to be totally manageable. Not sure about mobile platforms though - any insight appreciated.
Hi Adam
I think that the correct way to handle the per-object data in this case would be to use separate streams of data. You could leave the existing vertex formats alone and add another stream of vertex data for object position, rotation, and scale. You can set a divisor on streams in OpenGL so you could say that the GL should only update its index into this second stream for every four vertices processed. That way you're reducing the extra load to only (12 bytes for position + 4 bytes for rotation + 4 bytes for scale) * 10000 = 200k on top of your normal data for 10000 sprites, which is nothing at all! I wouldn't even worry about that on mobile platforms.
You can set a divisor on streams in OpenGL so you could say that the GL should only update its index into this second stream for every four vertices processed.
This is exactly the kind of thing that I was looking for - a hardcoded one-object-has-four-vertices solution probably won't suffice as a general-purpose method, but if there was a way to just specify an index per vertex, which could then be used to lookup some object data from a buffer, this would certainly reduce data load and provide an opportunity for specifying even more complex per-object data.
I'm still doing some research on this, but do you happen to know what keyword I should be looking for?
Edit: Actually, when modifying this to provide "per-primitive-data", telling OpenGL to update its index every X vertices would be kind of a general-purpose solution. All it would take would be to extend the AddVertices
and IDrawBatch
/ internal DrawBatch<T>
API to include a second per-primitive stream and all the rest could be done by the backend.
Not sure how that would affect vertex upload performance though, since every batch would then require a binding swap and two consecutive uploads - I suppose this shouldn't have a noticeable effect.
Edit: After looking a bit into this, multiple sources tell me that specifying vertex data per-primitive or specifying distinct index buffers for different attributes isn't really possible unless using GL3.x buffer textures with negative performance implications. If nothing else turns up, I guess I'm back at the initial solution of specifying object data per vertex. :|
Edit: Found the divisor command. It's only available in OpenGL 3.3 and ES 3.0. OpenGL 3.3 is fine for desktop machines, but ES 3.0 worries me a little. Using this as a base requirement, it would rule out most mobile devices. Fallback code in the backend could upload the same vertex data N times, but I'm not sure if it's a great idea to spam OpenGL calls like that, so that fallback probably isn't that good. Another one might be to expand that vertex data on the CPU before submitting it, which isn't that great either, especially when this is only done on devices that aren't very powerful in the first place.
Edit: It also seems that the divisor feature is only available when doing instanced rendering, not on a regular / continuous stream of vertices (?), which might be an issue.
When taking into account advanced shaders such as lighting, they also require information about an objects local rotation, so they can interpret its normalmap accordingly.
In these cases, an advanced vertex format could be used, but incidentally, object rotation was also part of the initial vertex format draft. So, maybe it does have its place there, as would object-local rotation in the shader:
// Per-Object / Per-Primitive data
Vector3 ObjPosition; // 12 bytes
Half ObjRotation // 2 bytes
// Actual Per-Vertex data
Vector2 LocalPosition; // 8 bytes
Vector2h TexCoord; // 4 bytes
ColorRgba Color; // 4 bytes
// Total: 40 bytes per vertex
// Compressed: 30 bytes per vertex
// Before: 24 bytes per vertex
Updated shader:
// Camera-constant data
uniform vec3 camPos; // Position of the camera in world coordinates
uniform float camFocusDist; // FocusDist of the camera.
uniform mat2 camRotation; // Transformation matrix of the camera's Z rotation
uniform bool camParallax; // If true, 2D parallax projection is applied by the camera
// Object-local data
attribute vec3 objPos; // Position of the object in world coordinates
attribute float objRot; // Rotation of the object in degree (to better use Half Float precision)
// Vertex-local data
attribute vec3 vertexLocal; // Object-local vertex position
// Draft of the main operations to perform
void main()
{
vec3 localPos = vertexLocal;
// Apply parallax 2D projection
if (camParallax)
{
// Determine object scale based on camera properties and relative object position
float objScale = camFocusDist / (objPos.z - camPos.z);
// Transform local vertex coords according to parallax scale
localPos.xy *= objScale;
}
// Apply local object rotation to vertex coords
float objRotRadians = radians(objRot);
float rotSin = sin(objRotRadians);
float rotCos = cos(objRotRadians);
vec3 localPos = vec3(
vertexLocal.x * rotCos - vertexLocal.y * rotSin,
vertexLocal.x * rotSin + vertexLocal.y * rotCos,
vertexLocal.z);
// Determine vertex world position
vec3 worldPos = localPos + objPos;
// Transform vertex to view coordinates and account for camera rotation
vec3 viewPos = worldPos - camPos;
viewPos = vec3(camRotation * viewPos.xy, viewPos.z);
// Do OpenGL ortho projection
gl_Position = gl_ProjectionMatrix * viewPos;
}
With the vertex format growing again despite compression efforts, storing per-object / per-primitive data beside vertex data like this should be considered really, really carefully. Continuing to look out for alternatives.
Can a TexCoord really be compressed using half floats? A Half Float has a precision of three decimals between zero and one. However, when assuming a sprite sheet > 1024², the required precision to address each texel is clearly higher than three decimals. In 2D games, where some of them will require pixel-perfect rendering, this is not viable. Therefore, TexCoord needs to use a higher precision.
With that change, the only attribute left compressed is the object rotation, which only saves two bytes. Might as well use full precision and store rotations directly in radians then, with the added benefit of clarity and not requiring to introduce Half Float types to DualityPrimitives, as well as requiring OpenGL support for them.
// Per-Object / Per-Primitive data
Vector3 ObjPosition; // 12 bytes
float ObjRotation // 4 bytes
// Actual Per-Vertex data
Vector2 LocalPosition; // 8 bytes
Vector2 TexCoord; // 8 bytes
ColorRgba Color; // 4 bytes
// Total: 36 bytes per vertex
// Before: 24 bytes per vertex
Maybe I've just grown accustomed to this data growth, but 36 bytes per vertex doesn't seem that bad at this point. Feedback by graphics programmers appreciated.
All this vertex format extension stuff doesn't sound that great. Let's take a step back:
ICmpRenderer
, all that's left to transform is everything relative to the Camera: Position, parallax scale and rotation.So, since additional information is no longer required, here's the updated shader:
// Camera-constant data
uniform vec3 camPos; // Position of the camera in world coordinates
uniform float camZoom; // Zoom factor of the camera.
uniform mat2 camRotation; // Transformation matrix of the camera's Z rotation
uniform bool camParallax; // If true, 2D parallax projection is applied by the camera
// Vertex data
attribute vec3 vertexWorldPos; // The world position of the vertex
attribute float vertexZOffset; // Optional: The (sorting) Z offset that shouldn't affect parallax scale
// Draft of the main operations to perform
void main()
{
// This could be moved to a Duality-builtin vertex shader function which
// transforms a world coordinate into a view coordinate.
{
// Apply parallax 2D projection
float parallaxScale;
if (camParallax)
{
// Determine object scale based on camera properties and relative vertex position
parallaxScale = camZoom / (vertexWorldPos.z - camPos.z);
}
else
{
// Apply a global scale factor
parallaxScale = camZoom;
}
// Transform vertex to view coordinates and account for parallax scale,
// camera rotation and Z-offset
vec3 viewPos = worldPos - camPos;
viewPos.xy *= parallaxScale;
viewPos = vec3(camRotation * viewPos.xy, viewPos.z + vertexZOffset);
}
// Do OpenGL ortho projection
gl_Position = gl_ProjectionMatrix * viewPos;
}
The Z offset in the above shader would be an optional vertex attribute, so non-parallax depth sorting offsets can still be added. If not specified in the vertex stream, its value would naturally fall back to zero.
Note that IVertexData
and DrawBatch<T>
will need to be adjusted to account for the fact that Z offset is now a distinct attribute, and no longer included in the Pos.Z
coordinate. The Canvas
class might need to be adjusted as well.
New default vertex format which specifies them:
Vector3 Position; // 12 bytes
Vector2 TexCoord; // 8 bytes
ColorRgba Color; // 4 bytes
float Offset; // 4 bytes [Optional]
// Total: 28 bytes per vertex
// Before: 24 bytes per vertex
As an additional improvement, Duality shaders could be updated to feature builtin functions (besides the already existing builtin uniforms), which could provide a standard vertex transformation. This would add some more flexibility to change the exact transformation code later while still keeping old shader code working.
Implications of the above draft:
PreprocessCoords
: Improved usability. Just specify an objects vertices in world coordinates and be done with it.ICmpRenderer
Components, more work done by the GPU, which doesn't really mind anyway here.Usability++ Performance++ Cleanliness+
It should be possible to test the new transform and shader as a heads-up without changing anything in the core:
Minimal
vertex shader.PreprocessCoords
and submit vertices in world space instead.
PreprocessCoords
method from API, investigate whether other methods are now no longer necessary as well.develop-3.0-cam-vertex-transform
branch to work on this.ftransform()
. They can be replaced with the new transformation code over the course of implementing this issue.PreprocessCoords
and submit vertices in world space instead.
PreprocessCoords
method from API, investigate whether other methods are now no longer necessary as well.AbstractShader
for shader source preprocessing.#version
directive at the top.#line
directives to keep compiler error messages useful.AbstractShader
.Minimal
and sample shaders, as they will no longer be needed and in fact now cause errors.PreprocessCoords
and submit vertices in world space instead.
PreprocessCoords
method from API, investigate whether other methods are now no longer necessary as well.ShaderSourceBuilder
utility class, which will be used to merge shader source with various chunks of shared code.ShaderSourceBuilder
utility class.#version
directive at the top.ShaderSourceBuilder
to merge builtin with actual shader source as part of preprocessing, or do so directly in AbstractShader
.Minimal
and sample shaders, as they will no longer be needed and in fact now cause errors.PreprocessCoords
and submit vertices in world space instead.
PreprocessCoords
method from API, investigate whether other methods are now no longer necessary as well.ShaderSourceBuilder
in first iteration.ShaderSourceBuilder
to merge builtin with actual shader source as part of preprocessing.Minimal
and sample shaders, as they will no longer be needed and in fact now cause errors.PreprocessCoords
and submit vertices in world space instead.
PreprocessCoords
method from API, investigate whether other methods are now no longer necessary as well.PreprocessCoords
and submit vertices in world space instead.
PreprocessCoords
method from API, investigate whether other methods are now no longer necessary as well.DepthOffset
vertex attribute to all vertex formats in Duality and sample projects.DepthOffset
attribute instead of adding it to their position. Also adjusted Canvas
accordingly.PreprocessCoords
method from API, investigate whether other methods are now no longer necessary as well.ModelViewMatrix
to ViewMatrix
on most occurrences to reflect what Duality is actually doing.*= clampedNear / focusDist
equation whether the near dist really needs to be part of that, or is actually destructive in cases where it's not 1.0f
. Tests seem to indicate that it's correct, but need to verify.PreprocessCoords
method from API, investigate whether other methods are now no longer necessary as well.*= clampedNear / focusDist
equation whether the near dist really needs to be part of that, or is actually destructive in cases where it's not 1.0f
. Tests seem to indicate that it's correct, but need to verify.PreprocessCoords
method from API, investigate whether other methods are now no longer necessary as well.DrawDevice
IsCoordInView
to IsSphereVisible
and added a draft implementation - doesn't seem to work yet though.RenderMatrix
to RenderMode
and its fields to World
and Screen
.PerspectiveMode
to ProjectionMode
and its fields to Orthographic
and Prespective
.PreprocessCoords
from DrawDevice
API entirely.Canvas
code a bit.IsSphereVisible
method. It doesn't seem to do proper culling so far. Optimize as soon as it works.
IsSphereInView
method in all projection and render modes.IsSphereInView
implementation.IsSphereInView
/ object culling if necessary.
IsSphereInView
/ object culling if necessary.
IsSphereInView
/ object culling if necessary.
NearZ
values in all samples to use the new default of 50.IsSphereInView
/ object culling if necessary.
IsSphereInView
being somewhat negligible compared to other factors.
Right now, parts of the vertex transformation in rendering happens on the CPU using
PreprocessCoords
or manually:This approach has several problems:
But it also solves the following problem:
If there is a way to solve this using a GPU vertex transform approach, there's no reason not to move all vertex transform calculations to the GPU for better shader support and performance. Customized solutions could still be implemented using custom shaders.