apache / dubbo

The java implementation of Apache Dubbo. An RPC and microservice framework.
https://dubbo.apache.org/
Apache License 2.0
40.48k stars 26.43k forks source link

Memory leak due to hessian2、FutureContext、RpcContext #7770

Closed icankeep closed 3 years ago

icankeep commented 3 years ago

Environment

Steps to reproduce this issue

Related issue: #7271 Demo to reproduce: memory leak demo

OOM error log: image

heap memory objects: image

ThreadLocal reference: image

image

Everything will be ok when I use kryo instead of hessian2

icankeep commented 3 years ago

org.apache.dubbo.common.serialize.hessian2.Hessian2ObjectInput#cleanup

  public void cleanup() {
      if(mH2i != null) {
          mH2i.reset();
      }
  }

change to

  public void cleanup() {
      if(mH2i != null) {
          mH2i.reset();
      }
      INPUT_TL.remove();
  }

will be ok

icankeep commented 3 years ago

给hessian2加了remove(),再调试,还是会OOM,dump内存,发现ThreadLocal中还是有FutureContext的引用

image

AlbumenJ commented 3 years ago

In my opinion, this may not a memory leak. It is more like Dubbo has received a large number of request. In https://github.com/icankeep/dubbo-memory-leak-issue/blob/0c062f8fdff85cf4934b126d19e6f60352c8c1a5/consumer/src/main/java/demo/consumer/HelloController.java#L71, you have create a lot of RPC request with large content and you have specify a quite small heap (Xmx300M). For small heap, GC is undoubtedly be frequent, ThreadLocal as a weak reference will also be removed always.

icankeep commented 3 years ago

但是在这个demo中,GC并没有回收掉ThreadLocal中hessian2的相关数据,也没有处理掉RpcContext和FutureContext中的结果引用

ThreadLocalMap.Entry 中只有key是弱引用,value是强引用

如PR里这样改了hessian2之后,在Consumer端如果非常多线程发起需要大数据量的调用也必须在finally中移除 RpcContext和FutureContext,否则还是会造成比较严重的内存问题

icankeep commented 3 years ago

使用了较多线程和比较大的数据量只是想放大这个内存问题

icankeep commented 3 years ago

这个在线程池核心线程数较大时,比如100个核心线程,运行了这样的100task,效果格外明显

每个线程中的ThreadLocal都引用着大量的数据,在GC后也不能回收

在运行完成后dump内存可以清楚看出引用,具体的可以看上面的截图

AlbumenJ commented 3 years ago

@icankeep

但是在这个demo中,GC并没有回收掉ThreadLocal中hessian2的相关数据,也没有处理掉RpcContext和FutureContext中的结果引用

这种是需要考虑到数据量大的时候序列化的压力的,ThreadLocal 中缓存的是去减少生成序列化工具带来的处理损耗,由于序列化和反序列化的线程是相对固定的(程序运行中总会使用那几个线程去进行序列化),所以使用了 ThreadLocal 提升性能