Closed Zhangshunyu closed 2 years ago
@Zhangshunyu thanks for the analysis. i see you've already done some profiling and code analysis
When a session is released, the corresponding cache of the session is not released, causing the cache to accumulate until oom
would you take a step further share the related code paths for this issue? or even maybe file a PR directly and we can discuss further from there?
@xushiyan Hi shiyan, thanks for you reply, i find this is a spark problem as spark didnt release the cache for 1 session when the session is closed. we have fixed this in spark.
@Zhangshunyu you meant fixed in spark source code? it would be interesting to see the upstream patch for spark.
Hudi's current driver cache management has some problems: 1) The cache is only shared within the session. In different sessions, because the cache is not shared, the cache information of the same table is loaded repeatedly; 2) When a session is released, the corresponding cache of the session is not released, causing the cache to accumulate until oom 3) When the session is first built, the query table will create the relation of the table, and all file status information will be loaded during the process of building the relation
Combining the above three points leads to the following results: 1) Multiple session connections are connected to the same driver for concurrent execution, memory * N will eventually lead to driver oom 2 Each query needs to build a relation, which is equivalent to executing the first query.
when session1 query and closed, the cache not released:
when session2 query and closed, the cache not released, the heap will increase with time going: num #instances#bytes class name
with more sessions connected and closed after query, driver will OOM