Non-VBO IQM model code is twice slower than MD5 model code

illwieckz commented 4 years ago

So, the fact the game get significant performance drop on some older GPU is not only an asset nor a game-specific issue. See https://github.com/Unvanquished/Unvanquished/issues/1207 for the issue on Unvanquished side.

After some optimizations done (see #389 pull request) running Unvanquished on some old hardware (AMD Athlon 64 3200+ single core with ATI Radeon X1950 PRO), I've discovered the MD5 model code is twice faster for the exact same model (the IQM model being the MD5 one after conversion).

IQM model:

MD5 model:

IQM model:

MD5 model:

Note: the MD5 model code seems to not have VBO-based fast alternative code for supported hardware. Edit: it has, it just lives in another dedicated function named Tess_SurfaceVBOMD5Mesh.

illwieckz commented 4 years ago

I believe I found why MD5 code is faster than IQM. After a huge rewrite to make both code look almost the same, I noticed this difference:

MD5 code:

            float *lastWeight = boneWeight + surfaceVertex->numWeights;

            for ( ; boneWeight < lastWeight; boneWeight++,
                boneIndex++ )
            {
                TransformPoint( &bones[ *boneIndex ], *surfacePosition, tmp );
                VectorMA( position, *boneWeight, tmp, position );

                TransformNormalVector( &bones[ *boneIndex ], *surfaceNormal, tmp );
                VectorMA( normal, *boneWeight, tmp, normal );

                TransformNormalVector( &bones[ *boneIndex ], *surfaceTangent, tmp );
                VectorMA( tangent, *boneWeight, tmp, tangent );

                TransformNormalVector( &bones[ *boneIndex ], *surfaceBinormal, tmp );
                VectorMA( binormal, *boneWeight, tmp, binormal );
            }

IQM code:

            byte *lastBlendIndex = modelBlendIndex + 4;

            for ( ; modelBlendIndex < lastBlendIndex; modelBlendIndex++,
                modelBlendWeight++ )
            {
                float weight = *modelBlendWeight * weightFactor;

                TransformPoint( &bones[ *modelBlendIndex ], modelPosition, tmp );
                VectorMA( position, weight, tmp, position );

                TransformNormalVector( &bones[ *modelBlendIndex ], modelNormal, tmp );
                VectorMA( normal, weight, tmp, normal );

                TransformNormalVector( &bones[ *modelBlendIndex ], modelTangent, tmp );
                VectorMA( tangent, weight, tmp, tangent );

                TransformNormalVector( &bones[ *modelBlendIndex ], modelBitangent, tmp );
                VectorMA( binormal, weight, tmp, binormal );
            }

You'll notice those loops do the same, except in MD5 case, the number of loops is variable (surfaceVertex->numWeights) and in IQM case, the number of loops is fixed (4).

I added some debug log to get the value of surfaceVertex->numWeights and I've seen this…

Warn: num weights: 2     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 2     
Warn: num weights: 1     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 3     
Warn: num weights: 2     
Warn: num weights: 3     
Warn: num weights: 2     
Warn: num weights: 3     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 1     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 3     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 1     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 2     
Warn: num weights: 1

For both IQM and MD5 implementation, such loop is living in another loop iterating one time per vertex. This model has 3863 vertexes, so in IQM case the inner loop runs 3863×4 time, so 15452 times. I added a counter on the MD5 code, with this model the inner loop runs 6294 times, so a bit less than the half… it can explain why MD5 code is twice faster.

illwieckz commented 4 years ago

Hi @lsalzman, I hope I won't bother you that much, would you know any trick to help us recover MD5 performances with IQM?

Dæmon Engine is a free open source engine based on idTech3, written to run the free open source game Unvanquished but also thought to be reusable by other projects. I was looking for a way to optimize IQM model rendering on hardware were maximum number of bones is too low to use another GPU-accelerated code.

I discovered the same model runs twice faster when loaded as MD5 than when loaded as IQM. It looks like I found out where the time is spent, with MD5 model some weight data length seems to be dynamic, while with IQM model the data length is fixed so the code iterates a lot more. I'm looking for a way to properly skip some iterations for example.

See previous posts from this thread for details.

lsalzman commented 4 years ago

There is nothing inherent in IQM that should make it any harder to accelerate. For simplicity for GPU acceleration there are 4 weights per joint, but some of the weights can actually be 0 if they are unnecessary, so that you can easily compute a maximum number of influences per joint for an entire mesh.

illwieckz commented 4 years ago

Hi, thank you for your answer. Does that mean we can skip some computation anytime we see a weight being zero?

lsalzman commented 4 years ago

A 0 weight has no influence on the joint, because the influence is multiplied by 0.

On Mon, Oct 26, 2020 at 4:34 PM Thomas Debesse notifications@github.com wrote:

Hi, thank you for your answer. Does that mean we can skip some computation anytime we see a weight being zero?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DaemonEngine/Daemon/issues/390#issuecomment-716807237, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALDVUKB6TY5JIUW6YETCJDSMXMOVANCNFSM4S42QSPA .

illwieckz commented 4 years ago

OK thanks for the confirmation. I tested it without noticing any visual defect. Unfortunately I only saved 2 fps when I expect to multiply fps by two, but that also means we may look for some other algorithm that would precompute things to prevent branching within loops. Thanks a lot.

lsalzman commented 4 years ago

Preprocessing the model at load time rather than checking the number of influences at render time is preferred and a rather convenient time to do it. If you can scan at load that no joint has more than, say, 2 influences, then you can simply use a function that only always processes 2 influences for every joint.

On Mon, Oct 26, 2020 at 4:45 PM Thomas Debesse notifications@github.com wrote:

OK thanks for the confirmation. I tested it without noticing any visual defect. Unfortunately I only saved 2 fps when I expect to multiply fps by two, but that also means we may look for some other algorithm that would precompute things to prevent branching within loops. Thanks a lot.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DaemonEngine/Daemon/issues/390#issuecomment-716812459, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALDVUPJA676BIIUGQ6MSN3SMXNVFANCNFSM4S42QSPA .

illwieckz commented 4 years ago

@zturtleman did an awesome job on ioquake3 side.

He said about rendering his own scene:

About 30% of ioquake3 CPU time is in RB_IQMSurfaceAnim()

That sounds very similar to what we get in our scene (I get even 50%)!

He also said after having done some improvements:

Vertex skinning for my turtle IQM use to take 3.6 times as long as MD3. Now it takes 1.6 times as long as MD3.

So, basically, it looks like he more than doubled speed.

That's a huuuge work, I don't know what is useful, but those are commits he linked:

https://github.com/ioquake/ioq3/compare/39e2113c73b8...11337c9fa2fa https://github.com/ioquake/ioq3/commit/1994801e1c17a2a7c50b833e9eab487af1637738 https://github.com/ioquake/ioq3/commit/c7ebe82131db2c94d01c87803df588b367cd29d3 https://github.com/ioquake/ioq3/commit/d404519cce565402aa98c3f9943221ed6ddb2790

It would be very good to see his improvements ported to Dæmon.

That said, I'm not sure to be able to do myself the port. So I'm looking for help.

Also, this is too much work so unless a wizard does it (hello @zturtleman :grin:), this will not be for 0.52.0.

If someone wants to pick this task, note that there is a work-in-progress pull request that must be merged first: #389 Any work must be done atop of that or merging would be hell.

DolceTriade commented 3 years ago

THis is a good find and we should try to port ztm's work if possible or just precompute the weights during load as suggested earlier.

illwieckz commented 3 years ago

@lsalzman can we assume that if a weight is null, following weights are null as well? I've added some debug print to our code and with all the models I tested I never seen a weight being non-null after a null weight.

lsalzman commented 3 years ago

The official IQM exporters do sort the weights as such

On Wed, Dec 9, 2020, 01:11 Thomas Debesse notifications@github.com wrote:

@lsalzman https://github.com/lsalzman can we assume that if a weight is null, following weights are null as well? I've added some debug print to our code and with all the models I tested I never seen a weight being non-null after a null weight.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DaemonEngine/Daemon/issues/390#issuecomment-741555015, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALDVUPRMUQOYCWTPYIEFSLST4IJJANCNFSM4S42QSPA .

illwieckz commented 3 years ago

Thanks for your response. That's nice! So maybe we can double the performance just by precomputing the amount of weights by iterating the weights to the first null one. =)

We would assume IQM exporters not sorting the weights being broken and unsupported then.

We use FTE's fork of your iqm exporter (the fork having working translation/rotation/scale on both model and skeleton) so we're safe.

illwieckz commented 1 year ago

One cause of this performance hit may be that the md5 model code loads floats as floats and converts them to half float when uploading to GPU, while the iqm model code loads floats as half float from the start.

It means the software implementation of md5 model processes floats all the way down, while the software implementation of iqm model has to convert from half float to float then from float to half float again every time it does a float computation.

A solution for that would be to make the IQM code converts half floats to float at model loading time for the software implementation to process float all the way down, then convert them back to half float when uploading to GPU.

slipher commented 2 weeks ago

Well, we got rid of half floats. Did that help?

DaemonEngine / Daemon

Non-VBO IQM model code is twice slower than MD5 model code #390