apache / gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
https://gravitino.apache.org
Apache License 2.0
1.1k stars 348 forks source link

[Improvement] Stress test report on gravitino fileset API #2647

Closed bknbkn closed 7 months ago

bknbkn commented 8 months ago

What would you like to be improved?

Tests show that the list api on main branch have bad performance,

There's 2 metalakes, 1 catalogs,10 schemas, 100 fileset in the test environment

The sever has 1 cpu and 4GB mem

config in jetty is:

We use 500 client to test,here's some of the result data:

image

the list request response time is very long:

image

And the TPS is not hign (about 100+) :

image

At the same time on the server side, cpu is not fully used :

image

How should we improve?

I wonder if list requests can be optimized more.

And other observable phenomenon is that same type of read request seem to block each other. For example, when testing listCatalog, the response of other listCatalog request will be significantly slower, but the response of listMetalake will not be affected. I think this mutual exclusion between read requests is unreasonable, May be there is a lock between read requests.

YxAc commented 8 months ago

@bknbkn can you offer detailed test documentation with google doc? Thx.

bknbkn commented 8 months ago

ok, I will offer more test case and test detail in google doc later

bknbkn commented 8 months ago

@YxAc @yuqi1129 More test details are being updated. Before that, I found that the list api uses a write lock, which should be the cause of the slow speed. However, the list api is very frequently used in actual use (especially when using the UI interface, the list often It is the pre-operation of get). This write lock seems to be deliberately designed like this. I wonder the reason for the original design and whether there is anything that can be improved.

ab867bae-ff4d-4bae-b8b6-1e1d5aba30d9

yuqi1129 commented 8 months ago

@bknbkn Thanks for your hard work on it. Indeed the write lock is deliberately designed, please see https://github.com/datastrato/gravitino/pull/2260#discussion_r1495156151, in our initial design, list operation is far less than get, load operations in real scenario, so I chose the second solution in https://github.com/datastrato/gravitino/pull/2260#discussion_r1495156151. I'm also working on improving Gravitino's overall performance of Gravitino server, so we'd better find a compromise about it and needs more test cases and real scenario, can you share your real-world environment, such as the percentage of Read/Write/List operations?

bknbkn commented 8 months ago

@yuqi1129 Thanks for the reply, I will provide a reference scenario. When I use Web UI, the whole usage process is like this:

First it will list all metalakes (use list metalak):

image

And then I choose one metalake, it will use get metalake api:

image

Then list catalogs will be used:

image

After the above operations, I will use get catalog to obtain a specific catalog.

so in this scenario, the list/get will be 1 : 1

bknbkn commented 8 months ago

@YxAc @yuqi1129 Updated preliminary test results in google doc, you can take a look https://docs.google.com/document/d/1GPynb8SbxcIU2s6h3W-mPqeRVqmv0ac9_rBJm63awqs/edit