Closed bknbkn closed 7 months ago
@bknbkn can you offer detailed test documentation with google doc? Thx.
ok, I will offer more test case and test detail in google doc later
@YxAc @yuqi1129
More test details are being updated. Before that, I found that the list api
uses a write lock, which should be the cause of the slow speed. However, the list api
is very frequently used in actual use (especially when using the UI interface, the list often It is the pre-operation of get). This write lock seems to be deliberately designed like this. I wonder the reason for the original design and whether there is anything that can be improved.
@bknbkn
Thanks for your hard work on it. Indeed the write lock is deliberately designed, please see https://github.com/datastrato/gravitino/pull/2260#discussion_r1495156151, in our initial design, list operation is far less than get
, load
operations in real scenario, so I chose the second solution in https://github.com/datastrato/gravitino/pull/2260#discussion_r1495156151. I'm also working on improving Gravitino's overall performance of Gravitino server, so we'd better find a compromise about it and needs more test cases and real scenario, can you share your real-world environment, such as the percentage of Read/Write/List operations?
@yuqi1129 Thanks for the reply, I will provide a reference scenario. When I use Web UI, the whole usage process is like this:
First it will list all metalakes (use list metalak):
And then I choose one metalake, it will use get metalake api:
Then list catalogs will be used:
After the above operations, I will use get catalog to obtain a specific catalog.
so in this scenario, the list/get will be 1 : 1
@YxAc @yuqi1129 Updated preliminary test results in google doc, you can take a look https://docs.google.com/document/d/1GPynb8SbxcIU2s6h3W-mPqeRVqmv0ac9_rBJm63awqs/edit
What would you like to be improved?
Tests show that the
list
api on main branch have bad performance,There's 2 metalakes, 1 catalogs,10 schemas, 100 fileset in the test environment
The sever has 1 cpu and 4GB mem
config in jetty is:
We use 500 client to test,here's some of the result data:
the list request response time is very long:
And the TPS is not hign (about 100+) :
At the same time on the server side, cpu is not fully used :
How should we improve?
I wonder if list requests can be optimized more.
And other observable phenomenon is that same type of read request seem to block each other. For example, when testing listCatalog, the response of other listCatalog request will be significantly slower, but the response of listMetalake will not be affected. I think this mutual exclusion between read requests is unreasonable, May be there is a lock between read requests.