[Improvement] Stress test report on gravitino fileset API

bknbkn commented 8 months ago

What would you like to be improved?

Tests show that the list api on main branch have bad performance,

There's 2 metalakes, 1 catalogs,10 schemas, 100 fileset in the test environment

The sever has 1 cpu and 4GB mem

config in jetty is:

gravitino.server.webserver.maxThreads = 200 (The max thread size of the built-in web server)
gravitino.server.webserver.stopTimeout = 30000 (The stop timeout of the built-in web server)
gravitino.server.webserver.threadPoolWorkQueueSize = 2000

We use 500 client to test，here's some of the result data：

the list request response time is very long：

And the TPS is not hign (about 100+) :

At the same time on the server side, cpu is not fully used ：

How should we improve?

I wonder if list requests can be optimized more.

And other observable phenomenon is that same type of read request seem to block each other. For example, when testing listCatalog, the response of other listCatalog request will be significantly slower, but the response of listMetalake will not be affected. I think this mutual exclusion between read requests is unreasonable, May be there is a lock between read requests.

YxAc commented 8 months ago

@bknbkn can you offer detailed test documentation with google doc? Thx.

bknbkn commented 8 months ago

ok, I will offer more test case and test detail in google doc later

bknbkn commented 8 months ago

@YxAc @yuqi1129 More test details are being updated. Before that, I found that the list api uses a write lock, which should be the cause of the slow speed. However, the list api is very frequently used in actual use (especially when using the UI interface, the list often It is the pre-operation of get). This write lock seems to be deliberately designed like this. I wonder the reason for the original design and whether there is anything that can be improved.

ab867bae-ff4d-4bae-b8b6-1e1d5aba30d9

yuqi1129 commented 8 months ago

@bknbkn Thanks for your hard work on it. Indeed the write lock is deliberately designed, please see https://github.com/datastrato/gravitino/pull/2260#discussion_r1495156151, in our initial design, list operation is far less than get, load operations in real scenario, so I chose the second solution in https://github.com/datastrato/gravitino/pull/2260#discussion_r1495156151. I'm also working on improving Gravitino's overall performance of Gravitino server, so we'd better find a compromise about it and needs more test cases and real scenario, can you share your real-world environment, such as the percentage of Read/Write/List operations?

bknbkn commented 8 months ago

@yuqi1129 Thanks for the reply, I will provide a reference scenario. When I use Web UI, the whole usage process is like this:

First it will list all metalakes (use list metalak):

And then I choose one metalake, it will use get metalake api:

Then list catalogs will be used:

After the above operations, I will use get catalog to obtain a specific catalog.

so in this scenario, the list/get will be 1 : 1

bknbkn commented 8 months ago

@YxAc @yuqi1129 Updated preliminary test results in google doc, you can take a look https://docs.google.com/document/d/1GPynb8SbxcIU2s6h3W-mPqeRVqmv0ac9_rBJm63awqs/edit

apache / gravitino

[Improvement] Stress test report on gravitino fileset API #2647

What would you like to be improved?

How should we improve?