apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.38k stars 2.42k forks source link

Hudi's current driver cache management has some problems #6075

Closed Zhangshunyu closed 2 years ago

Zhangshunyu commented 2 years ago

Hudi's current driver cache management has some problems: 1) The cache is only shared within the session. In different sessions, because the cache is not shared, the cache information of the same table is loaded repeatedly; 2) When a session is released, the corresponding cache of the session is not released, causing the cache to accumulate until oom 3) When the session is first built, the query table will create the relation of the table, and all file status information will be loaded during the process of building the relation

Combining the above three points leads to the following results: 1) Multiple session connections are connected to the same driver for concurrent execution, memory * N will eventually lead to driver oom 2 Each query needs to build a relation, which is equivalent to executing the first query.

when session1 query and closed, the cache not released:

 num     #instances#bytes  class name
----------------------------------------------
   1:      34027608     5576638600  [C
   2:      34026743     1088855776  java.lang.String
   3:5648616      451889280  java.util.TreeMap
   4:2824225      406688400  java.net.URI
   5:2824144      316304128  org.apache.hadoop.fs.FileStatus
   6:2825322      180820608  java.util.TreeMap$Entry
   7:2824144      158152064  org.apache.hudi.common.model.HoodieBaseFile
   8:2824148      135559104  org.apache.hadoop.fs.permission.FsPermission
   9:2824144      135558912  org.apache.hudi.common.model.FileSlice
  10:2824144      135558912  org.apache.hudi.common.model.HoodieFileGroup
  11:282414490372608  org.apache.hudi.common.model.HoodieFileGroupId
  12:282426867782432  java.util.TreeSet
  13:282423367781592  org.apache.hudi.common.util.Option
  14:282420067780800  org.apache.hadoop.fs.Path
  15:282414467779456  java.util.Collections$ReverseComparator2
  16:282414467779456  java.util.TreeMap$Values
  17:282414467779456  org.apache.hudi.common.model.HoodieLogFile$LogFileComparator

when session2 query and closed, the cache not released, the heap will increase with time going: num #instances#bytes class name

----------------------------------------------
   1:     107472293    16876640256  [C
   2:     107470913     3439069216  java.lang.String
   3:      16945193     1355615440  java.util.TreeMap
   4:8472599     1220054256  java.net.URI
   5:5648288      632608256  org.apache.hadoop.fs.FileStatus
   6:8473607      542310848  java.util.TreeMap$Entry
   7:8472432      474456192  org.apache.hudi.common.model.HoodieBaseFile
   8:8472435      406676880  org.apache.hadoop.fs.permission.FsPermission
   9:8472432      406676736  org.apache.hudi.common.model.FileSlice
  10:8472432      406676736  org.apache.hudi.common.model.HoodieFileGroup
  11:2824144      316304128  org.apache.hadoop.fs.obs.OBSFileStatus
  12:8472432      271117824  org.apache.hudi.common.model.HoodieFileGroupId
  13:8472568      203341632  org.apache.hadoop.fs.Path
  14:8472557      203341368  java.util.TreeSet
  15:8472521      203340504  org.apache.hudi.common.util.Option
  16:8472432      203338368  java.util.Collections$ReverseComparator2
  17:8472432      203338368  java.util.TreeMap$Values
  18:8472432      203338368  org.apache.hudi.common.model.HoodieLogFile$LogFileComparator

with more sessions connected and closed after query, driver will OOM

xushiyan commented 2 years ago

@Zhangshunyu thanks for the analysis. i see you've already done some profiling and code analysis

When a session is released, the corresponding cache of the session is not released, causing the cache to accumulate until oom

would you take a step further share the related code paths for this issue? or even maybe file a PR directly and we can discuss further from there?

Zhangshunyu commented 2 years ago

@xushiyan Hi shiyan, thanks for you reply, i find this is a spark problem as spark didnt release the cache for 1 session when the session is closed. we have fixed this in spark.

xushiyan commented 2 years ago

@Zhangshunyu you meant fixed in spark source code? it would be interesting to see the upstream patch for spark.