datastrato / gravitino

World's most powerful data catalog service with providing a high-performance, geo-distributed and federated metadata lake.
https://datastrato.ai/docs/
Apache License 2.0
401 stars 166 forks source link

[#2527] fix: Fix catalog isolated classloaders can't be GC #2548

Closed yuqi1129 closed 1 month ago

yuqi1129 commented 2 months ago

What changes were proposed in this pull request?

Call AbandonedConnectionCleanupThread.uncheckedShutdown() to close the checking thread for connection.

Why are the changes needed?

It's a bug.

Fix: #2527

Does this PR introduce any user-facing change?

N/A.

How was this patch tested?

Test locally.

FANNG1 commented 2 months ago

It seems a little hacky, there may be other similar problems caused by multi classloaders. Could we reuse the classloader created by the expired catalog?

yuqi1129 commented 2 months ago

It seems a little hacky, there may be other similar problems caused by multi classloaders. Could we reuse the classloader created by the expired catalog?

I have thought about it, and in some cases, we can't reuse it, for example, if the users set the package path or change configurations that need to be reloaded, we can't use the old one.

yuqi1129 commented 2 months ago

@jerryshao What are your thoughts on bug solutions? The two methods mentioned above are not very appealing on their own.

diqiu50 commented 2 months ago

@yuqi1129 I don't think this change is good. Can we make sure that we have cleaned up all resources when closing the catalog? If we want to consider reuse, we first need to ensure that we can clean up thoroughly.

yuqi1129 commented 2 months ago

@diqiu50

Can we make sure that we have cleaned up all resources when closing the catalog?

In response to this query, I was unable to give a 100% YES or NO, given that it takes time and effort to examine all the potential resources in the catalog class loader, and some may require unique situations to be triggered.

Can you offer some practical advice?

FANNG1 commented 2 months ago

I'm thinking could we remove the expiration feature of catalogs? because there wouldn't be many catalogs and may produce potential classloader problems which are not easy to solve. @jerryshao what do you think?

jerryshao commented 2 months ago

a) Is this just an mysql issue, or it also has issue in PG?

if the users set the package path or change configurations that need to be reloaded, we can't use the old one.

package is an immutable catalog property, it cannot be altered after creation, this is the current design.

I'm thinking could we remove the expiration feature of catalogs? because there wouldn't be many catalogs and may produce potential classloader problems which are not easy to solve. @jerryshao what do you think?

This could be one solution, I'm wondering if there are other solutions.

Besides, this may not be only the MySQL issue, it seems like a classpath leakage problem, fixing MySQL here is not a thorough fix, we'd better figure out a better way.

yuqi1129 commented 2 months ago

Is this just an mysql issue, or it also has issue in PG?

Nope, I've verified that PG does not share the same issues and that no leaked thread has been located.

yuqi1129 commented 2 months ago

Then what if we only reuse the class loader when doing alter operations, I'm afraid there will be some unknown problem introduced by this change if we reuse them globally. Although I can't guarantee it will happen 100%.

@FANNG1 @jerryshao @diqiu50

jerryshao commented 2 months ago

It should have a thorough fix at the CatalogManager, right? I would suggest that we should have a better investigation and a thorough solution. Here is just one issue, there might be others. We should have a doc to list all the potential issues related to classpath and figure out a thorough solution.

yuqi1129 commented 2 months ago

@jerryshao @FANNG1 @diqiu50 I have drafted a document regarding the classloader issue. Due to capacity constraints, please correct me if I have not thought things and any suggestion is welcomed.

https://docs.google.com/document/d/16dQbOsscPBgzxwSsfH6yfP0A7o_CqvTfp4QCG9GpVTM/edit#heading=h.7cqv6s53t97o

yuqi1129 commented 2 months ago

@jerryshao The reason why IsolatedClassLaoder instances can't be unloaded is that

  1. Thread AbandonedConnectionCleanupThread is running and uses classes that are loaded by them.
  2. DriverManager has a reference to JDBCMySQLDriver that is loaded by IsolatedClassloader instance.

After making the necessary changes, the GC will be able to clean them effectively.

image image
jerryshao commented 2 months ago

Can you please check if other catalogs like Hive, Iceberg, and PG can correctly unload the classloader?

yuqi1129 commented 2 months ago

Can you please check if other catalogs like Hive, Iceberg, and PG can correctly unload the classloader?

Okay, I will check them all.

jerryshao commented 1 month ago

Is it ready for review?

yuqi1129 commented 1 month ago

Is it ready for review?

Yeah, It's ready now

jerryshao commented 1 month ago

@FANNG1 please help to review this PR.

jerryshao commented 1 month ago

Generally LGTM, @FANNG1 can you please take another look?