Performance issue vs Cocos2d-x v3.17

Yehsam23 commented 1 year ago

Download Test Project: TestClip.zip

The following tests were conducted using an Android HTC U11 Plus with the build release APK.

When using 200 ClippingNode and Sprite, there is a significant difference in performance between Axmol and Cocos2d-x v3.17. Axmol's FPS is approximately 9-10, while Cocos2d-x v3.17's FPS is approximately 37-40.

Axmol: ax_clip

Cocos2d-x v3.17: v3_clip

When using 4000 Sprites with SpriteA->SpriteB->SpriteA->SpriteB repeating pattern, Axmol's FPS is approximately 28-30, while Cocos2d-x v3.17's FPS is approximately 38-39.

Axmol: ax_sprite

Cocos2d-x v3.17: v3_sprite

As can be seen from the above, there is a significant performance difference when using ClippingNode. According to experiment #1094, it is indeed a problem with Cocos2d-x v4.

Cocos2d-x v4 also had a similar issue initially. issue

I know that Axmol is a fork from Cocos2d-x-4.0. While it may be unfair to compare its performance to Cocos2d-x v3.17, I also hope to find ways to improve its performance.

aismann commented 1 year ago

@Yehsam23 Please make an "official cpp-test" on clipping section too. -> It can help to detect "issues" with changes in the future too

Yehsam23 commented 1 year ago

@aismann Is to submit a Pull Request directly in "cpp-tests" to add a clipping node performance test?

aismann commented 1 year ago

@aismann Is to submit a Pull Request directly in "cpp-tests" to add a clipping node performance test?

Yes. thas what I mean (Your test scenario will be part of regular tests then)

aismann commented 1 year ago

This is on my machine (win11, 64b, VS22):

rh101 commented 1 year ago

This is on my machine (win11, 64b, VS22):

The result is only meaningful when compared to the performance of Cocos2d-x v3.17, as it is the difference in performance of the OpenGL renderer in Cocos2d-x v4/Axmol that is the issue. Do you have the result of this test on Cocos2d-x v3.17 using the exact same setup to compare?

On a mobile device, using the renderer in Cocos2d-x/Axmol, there is a limit on how many draw calls would be reasonable, and that limit is largely dependent on what the app is rendering. Anything over a few hundred draw calls cannot be expected to perform well using the current rendering implementation, but the important question here is why there is such a big difference between Cocos2d-x v3.17 and Cocos2d-x v4/Axmol when using OpenGL.

Yehsam23 commented 1 year ago

Here is the performance test result for 1 frame, using 200 ClippingNode and Sprite each. It seems that the biggest performance difference occurs in GlBufferData(). However, I am not sure what is happening and currently unable to locate the issue.

PS: I forced Cocos2d-x v3.17 to use VBO instead of VAO because Axmol did not use VAO.

Axmol: capture 2023-03-16 下午4 22 07

Cocos2d-x v3.17: capture 2023-03-16 下午4 27 44

rh101 commented 1 year ago

It seems that the biggest performance difference occurs in GlBufferData().

Do you notice that the renderer in Axmol is also doing that twice, taking twice as much time, because of the two calls to flush()? I wonder if that's required, or it's a mistake. If one of those flush calls is eliminated, then that alone would be a significant performance boost.

Yehsam23 commented 1 year ago

Do you notice that the renderer in Axmol is also doing that twice, taking twice as much time, because of the two calls to flush()? I wonder if that's required, or it's a mistake. If one of those flush calls is eliminated, then that alone would be a significant performance boost.

Because Cocos2d-x v3 does not have a standalone processGroupCommand function, and instead directly uses visitRenderQueue, they are counted together, and in fact, the number of operations is the same: one for Stencil and one for Sprite.

rh101 commented 1 year ago

Because Cocos2d-x v3 does not have a standalone processGroupCommand function, and instead directly uses visitRenderQueue, they are counted together, and in fact, the number of operations is the same: one for Stencil and one for Sprite.

Fair enough, so in this case Axmol is taking ~118ms versus ~25ms for Cocos2dx v3.17 for the GlBufferData() in this test, which is a massive difference.

rh101 commented 1 year ago

@Yehsam23 If you have time, I'm curious to know what performance you get out of using round-robin buffering on your Android device. The changes are to the BufferGL.cpp and BufferGL.h files in this zip (which go into the path core/renderer/backend/opengl/). You can simply overwrite your files with these, with no other changes required.

BufferGL.zip

On my Android device it went from ~8 FPS to ~27 FPS for the ClippingNode test when using 5 buffers, and any more than that didn't seem to make a difference.

The SpriteA/SpriteB test FPS didn't change, but then again that specific test, with 4000 draw calls, is just not realistic, and the difference in FPS between Cocos2d-x v3.17 and V4/Axmol is minor.

Yehsam23 commented 1 year ago

@rh101 I have started testing my project and it seems to be working normally with a significant improvement in performance. Today, I will try various methods to see if there are any issues.

The changes I made are similar to the issue mentioned earlier with Cocos2d-x issue I tried modifying "TriangleCommandBufferManager::createBuffer()" to not update the data at the beginning, but it resulted in a crash.

BTW, thank you for your help.

rh101 commented 1 year ago

I have started testing my project and it seems to be working normally with a significant improvement in performance. Today, I will try various methods to see if there are any issues.

I'm not actually sure if it's the correct way to improve the performance, and there may be other things we can do to also improve performance of the renderer. If something like this is used as part of the improvements, then it may be best to limit any such modification to Android, and also make it configurable. Using extra buffers does take up more memory, so leaving it up to the developer to configure whether they need the extra buffers may be the preferred option.

BTW, thank you for your help.

It's no problem at all. Any improvements we can make to this game engine will benefit everyone using it. For me personally, I don't mind investigating issues if and when time permits, and more often than not, it's a learning experience.

Yehsam23 commented 1 year ago

I found something interesting. I set the BufferCount generated by vertexBuffer in CCRenderer to 1 and set the BufferCount generated by indexBuffer to 10. The performance was great. Perhaps the real reason is that something happened with indexBuffer, and generating multiple Buffers happened to solve it.

So far, the @rh101 method that I have been testing has not revealed any issues.

rh101 commented 1 year ago

That is interesting. I assume that you've just added an extra parameter to the Device::newBuffer(), DeviceGL::newBuffer and related methods for the buffer count. I'll do more testing with such a change as well.

The only thing I was concerned about is the BufferGL::getHandler() method, which returns the currently used buffer. As long as it's called within the same cycle as the currently used buffer, but before the buffer index changes (which I assume is the case, but I'll double check), then all is well.

rh101 commented 1 year ago

Just noticed something else. With a single buffer for index and a single for vertex (default Axmol implementation), the graphics memory usage is all over the place, as can be seen here, varying from ~70MB to 135MB or so (at 8-9FPS):

When using an index buffer count of 10, and a vertex buffer count of 1, the graphics memory usage is very stable (at 31-33FPS):

These tests were carried out on a Google Pixel 4a, using the code supplied in the first post (200 clipping nodes).

aismann commented 1 year ago

@rh101 Will this changes also have affects on destop?

rh101 commented 1 year ago

Will this changes also have affects on destop?

I have noticed absolutely no difference in performance with or without the change on desktop, so if we do use these modifications, it should be limited to Android (or OpenGL ES).

Also, on desktop, 200 clipping nodes don't cause any performance issues on my specific Windows PC, and I only managed to get the FPS to drop when I hit around 450+ clipping zones (still over 50 FPS @ 1350 draw calls). I cannot imagine any good reason for someone to use that many clipping zones at all.

aismann commented 1 year ago

The comparing should be cocos2dx 3.17 vs axmol on desktop.

rh101 commented 1 year ago

The comparing should be cocos2dx 3.17 vs axmol on desktop.

On a Windows 10 PC: At 400 clipping nodes, Axmol ~59 FPS, Cocos2dx v3.17.2 ~59FPS. (1200 draw calls) At 450 clipping nodes, Axmol ~50 FPS, Cocos2dx v3.17.2 ~58FPS. (1350 draw calls) At 600 clipping nodes Axmol ~40 FPS, Cocos2dx v3.17.2 ~48 FPS. (1800 draw calls)

The Axmol tests, without or without the modifications listed earlier, give the same results when run on Windows 10. The difference in FPS may not be solely related to the renderer.

EDIT: Just in case anyone gets the wrong idea by looking at those figures, tests like this are not realistic, especially with the number of draw calls caused by these tests.

aismann commented 1 year ago

EDIT: Just in case anyone gets the wrong idea by looking at those figures, tests like this are not realistic, especially with the number of draw calls caused by these tests.

Thanks for verification. Right! But a abstract stress test is also a nice test scenario ;)

aismann commented 1 year ago

Another idea to find some improvements:

Comparing the callstack between the axmol Android and Desktop. Maybe the is something to check?

rh101 commented 1 year ago

Comparing the callstack between the axmol Android and Desktop. Maybe the is something to check?

It's OpenGL vs OpenGL ES, so there are different code paths due to different functionality, along with completely different GPUs (which are way more powerful on desktop), and that alone would be a reason the desktop rendering has better performance. Also, AX_ENABLE_CACHE_TEXTURE_DATA is set to true for Android, but false for other platforms, which is extra code that is running for Android, but if I recall correctly, being enabled didn't impact the performance too much.

rh101 commented 1 year ago

After modifying the code to only change the round-robin buffer index on each frame change (on Renderer::beginFrame() etc.), the frame rate improvement was the same as changing the index every time BufferGL::updateData() is called (as in my original test code).

Since 10 index buffers seem to improve this specific test case (200 clipping nodes), then the GPU is lagging behind by up to 10 frames, and this introduces other issues, such as input lag, and also the fact that 10x the memory is being allocated (10 x 192KB for each index buffer). Depending on the type of app/game being created, you would have to factor in the impact this will have.

The Apple Metal implementation uses 3 buffers (is it correct to call that triple-buffering?), and somehow manages to get better FPS, but I don't know if the Metal method of rendering is the same as what is used in OpenGL ES, and if the Apple GPU or Metal drivers are just better.

Yehsam23 commented 1 year ago

While testing Cocos2d-x v3.17, I found that after the following changes, the FPS drops suddenly, just like the current Axmol.

Forcing to use only VBO, because Axmol currently only uses VBO on OpenGL ES.
```
bool Configuration::supportsShareableVAO() const
{
return false;
xxxx
}
```

Change CCRenderer.cpp

void Renderer::setupVBO()
{
glGenBuffers(2, &_buffersVBO[0]);
// Issue #15652
// Should not initialize VBO with a large size (VBO_SIZE=65536),
// it may cause low FPS on some Android devices like LG G4 & Nexus 5X.
// It's probably because some implementations of OpenGLES driver will
// copy the whole memory of VBO which initialized at the first time
// once glBufferData/glBufferSubData is invoked.
// For more discussion, please refer to https://github.com/cocos2d/cocos2d-x/issues/15652
// mapBuffers();
}

To

void Renderer::setupVBO()
{
glGenBuffers(2, &_buffersVBO[0]);
// Issue #15652
// Should not initialize VBO with a large size (VBO_SIZE=65536),
// it may cause low FPS on some Android devices like LG G4 & Nexus 5X.
// It's probably because some implementations of OpenGLES driver will
// copy the whole memory of VBO which initialized at the first time
// once glBufferData/glBufferSubData is invoked.
// For more discussion, please refer to https://github.com/cocos2d/cocos2d-x/issues/15652
mapBuffers();
}

As mentioned earlier, Cocos2d-x also encountered performance issues on Android. Perhaps the real problem lies here, and it may not be necessary to solve it through 10 index buffers.

Update: I modified the CCRenderer's Renderer::TriangleCommandBufferManager::createBuffer() in Axmol and found that the FPS increased significantly.

    auto allocSize = sizeof(V3F_C4B_T2F);
    auto tmpData = malloc(allocSize);
    memset(tmpData, 0, allocSize);
    if (!tmpData)
        return;

    auto vertexBuffer = device->newBuffer(Renderer::VBO_SIZE * sizeof(V3F_C4B_T2F), backend::BufferType::VERTEX,
                                          backend::BufferUsage::DYNAMIC);
    if (!vertexBuffer)
    {
        free(tmpData);
        return;
    }
    vertexBuffer->updateData(tmpData, allocSize);

    free(tmpData);

    allocSize = sizeof(unsigned short);
    tmpData = malloc(allocSize);
    memset(tmpData, 0, allocSize);

    auto indexBuffer = device->newBuffer(Renderer::INDEX_VBO_SIZE * sizeof(unsigned short), backend::BufferType::INDEX,
                                         backend::BufferUsage::DYNAMIC);
    if (!indexBuffer)
    {
        free(tmpData);
        vertexBuffer->release();
        return;
    }
    indexBuffer->updateData(tmpData, allocSize);

    free(tmpData);

rh101 commented 1 year ago

I modified the CCRenderer's Renderer::TriangleCommandBufferManager::createBuffer() in Axmol and found that the FPS increased significantly.

Can verify that this modification gives over 3.5x performance increase on Android. Initialising with only a small chunk of memory seems to do the trick.

If there are no concerns put forward regarding this change, and if there are no issues with the current cpp-tests, then please consider creating a PR with just this modification. There is no need for the other changes related to increasing the buffer count.

Yehsam23 commented 1 year ago

I ran Android "cpp-tests" for a while and didn't find any issues so far. However, I couldn't test it on Windows because I don't have PC.

aismann commented 1 year ago

I However, I couldn't test it on Windows because I don't have PC. @rh101 told thats has no effect on desktops (Windows 10/11)

rh101 commented 1 year ago

@rh101 told thats has no effect on desktops (Windows 10/11)

Oh Yehsam is referring to the new changes, not the ones I initially suggested. So far the new change to Renderer::TriangleCommandBufferManager::createBuffer() has been working in my main project on Windows 10, but I'll go through cpp-tests as well to check for any issues.

In the meantime, @Yehsam23 would you please create the PR for the TriangleCommandBufferManager::createBuffer() change?

EDIT: cpp-tests on Windows 10 runs fine with the changes to TriangleCommandBufferManager::createBuffer().

aismann commented 1 year ago

Great work. Thanks!

axmolengine / axmol

Performance issue vs Cocos2d-x v3.17 #1121