apache / gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
https://gravitino.apache.org
Apache License 2.0
958 stars 302 forks source link

[Improvement] create and alter failed since the operations in catalog are async #3729

Closed mchades closed 1 month ago

mchades commented 4 months ago

What would you like to be improved?

there is a reload action in object (table and topic) alteration and creation, which used to obtain some values generated by the underlying catalog

The create and alter may be failed since the operations in the catalog are async:

How should we improve?

Here are three solutions I can think of right now:

  1. remove the reload action in creation and alteration If we choose to remove the reload operation, then we may need to refactor the API so that the return value is viod, otherwise it will result in an inconsistency with the manual load

  2. add await timeout to the reload action If the corresponding operation is not completed after the timeout, whether to return success or failure to the client is a point that needs to be considered.

mchades commented 4 months ago

@shaofengshi @jerryshao cc

noidname01 commented 3 months ago

Hi @mchades, I want to try to solve this issue.

I have done some code tracing from createTopic in TopicOperationDispatcher. I think how about just make createTopic synchronous by using like createTopicsResult.all().whenComplete() in KafkaCatalogOperations and forming new properties just like loadTopic implementation. But it may have some performance issues, so just rewriting reload action to try and timeout is also a common solution. WDYT?

mchades commented 3 months ago

Hi @mchades, I want to try to solve this issue.

I have done some code tracing from createTopic in TopicOperationDispatcher. I think how about just make createTopic synchronous by using like createTopicsResult.all().whenComplete() in KafkaCatalogOperations and forming new properties just like loadTopic implementation. But it may have some performance issues, so just rewriting reload action to try and timeout is also a common solution. WDYT?

@noidname01 Sorry for the late reply.

What's the different between createTopicsResult.all().whenComplete() and current usage of createTopicsResult.topicId().get(): https://github.com/datastrato/gravitino/blob/282fc0d488177b7944c99fa2806dee16143f37fe/catalogs/catalog-kafka/src/main/java/com/datastrato/gravitino/catalog/kafka/KafkaCatalogOperations.java#L229

Does createTopicsResult.all().get() really ensure the synchronization of creation? If this method can ensure synchronous creation, I think we can use it first, and performance issues can be considered in a separate PR.

noidname01 commented 3 months ago

@mchades Sorry for the late reply, too.

I've done some research in how Kafka creates the topics. It seems there's no difference between createTopicsResult.all().whenComplete() and createTopicsResult.topicId().get(), because the Kafka server will return metadata all at once. And even though the create method returns, the topic also needs to be propagated in Kafka system. So there is an unmeasurable time diff between the time that the create method returns and the time the created topic can be described.

To really solve this, I think add retry mechanism on reload action might be a feasible solution. WDYT?

mchades commented 3 months ago

@mchades Sorry for the late reply, too.

I've done some research in how Kafka creates the topics. It seems there's no difference between createTopicsResult.all().whenComplete() and createTopicsResult.topicId().get(), because the Kafka server will return metadata all at once. And even though the create method returns, the topic also needs to be propagated in Kafka system. So there is an unmeasurable time diff between the time that the create method returns and the time the created topic can be described.

To really solve this, I think add retry mechanism on reload action might be a feasible solution. WDYT?

Thanks for your feedback! So the conclusions and solutions obtained at present are as I described in the description and you tend to prefer option 2, right?

How should we improve?

Here are three solutions I can think of right now:

  1. remove the reload action in creation and alteration If we choose to remove the reload operation, then we may need to refactor the API so that the return value is viod, otherwise it will result in an inconsistency with the manual load

  2. add await timeout to the reload action If the corresponding operation is not completed after the timeout, whether to return success or failure to the client is a point that needs to be considered.

noidname01 commented 3 months ago

Yes, you're right.

mchades commented 3 months ago

@noidname01 After discussing with @jerryshao offline, it is better to only return the properties passed by the user during creation. This change allows us to eliminate the reload action from the creation process.

noidname01 commented 3 months ago

@noidname01 After discussing with @jerryshao offline, it is better to only return the properties passed by the user during creation. This change allows us to eliminate the reload action from the creation process.

So it is basically option 1 above but the return value is changed from void to Topic created using properties passed by the user during creation, instead of getting properties from reload action.

mchades commented 3 months ago

@noidname01 After discussing with @jerryshao offline, it is better to only return the properties passed by the user during creation. This change allows us to eliminate the reload action from the creation process.

So it is basically option 1 above but the return value is changed from void to Topic created using properties passed by the user during creation, instead of getting properties from reload action.

yeah. BTW, not only for creating topic, but the table and fileset are also need to bechanged

mchades commented 2 months ago

Hi @noidname01 , are you still working on this? Do you need any updates or assistance?

noidname01 commented 2 months ago

@mchades Yeah, but I'm busy solving other issues like: #4012 and #3755

I've tried the solution we've been discussed, but the integration test would fail because there are some properties need to be filled by the server (like PARTITION_COUNT and REPLICATION_FACTOR in KafkaTopic), so I will work on modifying the tests if this solution is acceptable.

mchades commented 2 months ago

@mchades Yeah, but I'm busy solving other issues like: #4012 and #3755

I've tried the solution we've been discussed, but the integration test would fail because there are some properties need to be filled by the server (like PARTITION_COUNT and REPLICATION_FACTOR in KafkaTopic), so I will work on modifying the tests if this solution is acceptable.

I think it's acceptable, but we should clarify this in the API comment.

noidname01 commented 2 months ago

@mchades Sure, will work on this and finish ASAP.

mchades commented 1 month ago

Hi @noidname01 , Is there any progress on this issue recently?

I would like to have it completed by the end of this week. If you have a draft locally, you can submit a PR, and then I can build on your work.

noidname01 commented 1 month ago

@mchades Sorry for the inactiveness in this PR. I've created a draft PR, I have done the main logic modification, the remaining to-do works is modifying the logic of ITs, which still use reload logic.