Memory leak due to hessian2、FutureContext、RpcContext

apache / dubbo

The java implementation of Apache Dubbo. An RPC and microservice framework.

https://dubbo.apache.org/

Apache License 2.0

40.48k stars 26.43k forks source link

Closed icankeep closed 3 years ago

icankeep commented 3 years ago

[x] I have searched the issues of this repository and believe that this is not a duplicate.
[x] I have checked the FAQ of this repository and believe that this is not a duplicate.

Related issue: #7271 Demo to reproduce: memory leak demo

OOM error log:

heap memory objects:

ThreadLocal reference:

Everything will be ok when I use kryo instead of hessian2

icankeep commented 3 years ago

org.apache.dubbo.common.serialize.hessian2.Hessian2ObjectInput#cleanup

  public void cleanup() {
      if(mH2i != null) {
          mH2i.reset();
      }
  }

change to

  public void cleanup() {
      if(mH2i != null) {
          mH2i.reset();
      }
      INPUT_TL.remove();
  }

will be ok

icankeep commented 3 years ago

给hessian2加了remove()，再调试，还是会OOM，dump内存，发现ThreadLocal中还是有FutureContext的引用

AlbumenJ commented 3 years ago

In my opinion, this may not a memory leak. It is more like Dubbo has received a large number of request. In https://github.com/icankeep/dubbo-memory-leak-issue/blob/0c062f8fdff85cf4934b126d19e6f60352c8c1a5/consumer/src/main/java/demo/consumer/HelloController.java#L71, you have create a lot of RPC request with large content and you have specify a quite small heap (Xmx300M). For small heap, GC is undoubtedly be frequent, ThreadLocal as a weak reference will also be removed always.

icankeep commented 3 years ago

但是在这个demo中，GC并没有回收掉ThreadLocal中hessian2的相关数据，也没有处理掉RpcContext和FutureContext中的结果引用

ThreadLocalMap.Entry 中只有key是弱引用，value是强引用

如PR里这样改了hessian2之后，在Consumer端如果非常多线程发起需要大数据量的调用也必须在finally中移除 RpcContext和FutureContext，否则还是会造成比较严重的内存问题

icankeep commented 3 years ago

使用了较多线程和比较大的数据量只是想放大这个内存问题

icankeep commented 3 years ago

这个在线程池核心线程数较大时，比如100个核心线程，运行了这样的100task，效果格外明显

每个线程中的ThreadLocal都引用着大量的数据，在GC后也不能回收

在运行完成后dump内存可以清楚看出引用，具体的可以看上面的截图

AlbumenJ commented 3 years ago

@icankeep

但是在这个demo中，GC并没有回收掉ThreadLocal中hessian2的相关数据，也没有处理掉RpcContext和FutureContext中的结果引用

这种是需要考虑到数据量大的时候序列化的压力的，ThreadLocal 中缓存的是去减少生成序列化工具带来的处理损耗，由于序列化和反序列化的线程是相对固定的（程序运行中总会使用那几个线程去进行序列化），所以使用了 ThreadLocal 提升性能