是否存在内存泄露

qingchengliu commented 4 years ago

Expected behavior 期望表现

在发生老年代 GC 时，可以回收掉一部分对象，使得老年代使用量不是持续走高。

Actual behavior 实际表现

在存在 *7 24h 的不中断**写入、查询流量时，老年代使用量持续攀升，在到达配置的阈值时发生 mixed gc，该 gc 每次回收老年代的空间逐渐变小，直至似乎完全不能有效回收。使用 jmap -histo:live 输出对象如下：

 num     #instances         #bytes  class name
----------------------------------------------
   1:        214418     4415380528  [B
   2:      86038730     4129859040  java.util.HashMap$Node
   3:      24274162     3393035136  [Ljava.util.HashMap$Node;
   4:      56960255     2278410200  com.baidu.hugegraph.structure.HugeVertexProperty
   5:      24915195     2101834880  [C
   6:      19771980     1265406720  java.util.HashMap
   7:      14656049     1172483920  com.baidu.hugegraph.structure.HugeVertex
   8:      29968600      958995200  java.lang.String
   9:      10052074      884582512  com.baidu.hugegraph.structure.HugeEdge
  10:      10052074      723749328  com.baidu.hugegraph.backend.id.EdgeId
  11:      10041895      642681280  java.util.LinkedHashMap$Entry
  12:      17391100      556515200  java.util.Date
  13:       4528391      398498408  java.util.LinkedHashMap
  14:       9583355      230000520  java.util.HashMap$EntrySet
  15:       9531636      228759264  java.util.HashMap$Values
  16:       5071322      202852880  com.baidu.hugegraph.structure.HugeEdgeProperty
  17:       7204870      172916880  java.lang.Long
  18:       4516839      108404136  com.baidu.hugegraph.backend.id.IdGenerator$StringId
  19:       4516154      108387696  java.util.LinkedHashSet

在将所有流量切走后，并隔一段时间后再手动触发full gc（若立刻触发full gc，则堆中对象与切流前一致），堆中对象 top 如下：

 num     #instances         #bytes  class name
----------------------------------------------
   1:        212981     4507197600  [B
   2:        859842      177283576  [C
   3:        737446       76694384  org.codehaus.groovy.runtime.metaclass.MetaMethodIndex$Entry
   4:        646393       72553624  [Ljava.lang.Object;
   5:       1024688       49185024  java.util.HashMap$Node
   6:        842536       26961152  java.lang.String
   7:        498103       23908944  com.huaban.analysis.jieba.DictSegment
   8:         56891       19705320  [Ljava.util.HashMap$Node;
   9:        168377       17511208  org.glassfish.grizzly.memory.PooledMemoryManager$PoolHeapBuffer
  10:        537234       17191488  org.codehaus.groovy.util.FastArray
  11:        356264       17100672  java.util.concurrent.ConcurrentHashMap$Node
  12:         57125       15996792  [Lorg.codehaus.groovy.util.ComplexKeyHashMap$Entry;
  13:        186960       14956800  org.glassfish.grizzly.http.util.ByteChunk
  14:         82539       13206240  java.lang.reflect.Method
  15:        196561       12579904  org.glassfish.grizzly.http.util.CharChunk
  16:        199760       11186560  org.glassfish.grizzly.http.util.BufferChunk
  17:        181404       10158624  org.glassfish.grizzly.http.util.DataChunk
  18:        395788        9498912  java.lang.Double
  19:        186660        8959680  [Lcom.huaban.analysis.jieba.DictSegment;

Steps to reproduce the problem 复现步骤

客户端与服务端的连接配置：

HugeClient client = new HugeClient(new HugeClientBuilder(url, hugeGraphDatabase)
                    .configTimeout(60)
                    .configIdleTime(100)
                    .configPool(maxConns, maxConnsPerRoute));

节点、关系的写入流量：

HGClient.writeClient().graph().appendVertexProperty(vertex);
HGClient.writeClient().graph().addVertex(vertex);
HGClient.writeClient().graph().addEdge(e);

使用gremlin bindings查询图：

g.V(VID).outE().hasLabel(LABEL).has('modify_time', gt(START_TIME_30)).inV().dedup().local(inE().hasLabel(LABEL).has('modify_time', gte(START_TIME_180)).order().by('modify_time', desc).limit(200).outV().dedup()).values('score').union(max(), mean())

服务端配置：

max heap: 62064M
JAVA_OPTIONS="${JAVA_OPTIONS} -XX:+UseG1GC -XX:+ParallelRefProcEnabled \
                      -XX:InitiatingHeapOccupancyPercent=40 -XX:G1RSetUpdatingPauseTimePercent=5"
schema.cache_capacity=100000
vertex.cache_capacity=2000000
vertex.cache_expire=60
edge.cache_capacity=200000
edge.cache_expire=60
hbase.threads_max=128

GC日志如下：

gc_log.zip

javeme commented 4 years ago

@qingchengliu 从提供的信息来看，没有发生内存泄露的。应该是缓存设置200万较大导致GC压力较大，不过你设置了1分钟缓存超时应该也是可以回收掉的，如果确实内存紧张可以调小缓存容量降低gc压力，比如：

vertex.cache_capacity=10000
edge.cache_capacity=1000

qingchengliu commented 4 years ago

@qingchengliu 从提供的信息来看，没有发生内存泄露的。应该是缓存设置200万较大导致GC压力较大，不过你设置了1分钟缓存超时应该也是可以回收掉的，如果确实内存紧张可以调小缓存容量降低gc压力，比如：
vertex.cache_capacity=10000
edge.cache_capacity=1000

@javeme 在我理解，200w个节点的缓存应该对应

30:        198377        9522096  java.util.concurrent.ConcurrentHashMap$Node

请教一下在我的使用场景下，java.util.HashMap$Node 主要是在哪儿产生的

   2:      86038730     4129859040  java.util.HashMap$Node
   3:      24274162     3393035136  [Ljava.util.HashMap$Node;

qingchengliu commented 4 years ago

@qingchengliu 从提供的信息来看，没有发生内存泄露的。应该是缓存设置200万较大导致GC压力较大，不过你设置了1分钟缓存超时应该也是可以回收掉的，如果确实内存紧张可以调小缓存容量降低gc压力，比如：
vertex.cache_capacity=10000
edge.cache_capacity=1000

@javeme 在将缓存容量调整至所述大小后，问题依旧。老年代使用率持续上涨且回收前后无太大变化。

gc 日志如下 gc_log.zip

javeme commented 4 years ago

@qingchengliu 从偏后的一次mixed gc日志来看，在gc后老年代依然还占用34G内存，如果缓存已经调小的情况下，这是不符合预期的，需要在调整参数后打印一下内存top对象。另外建议同时给出复现的client代码。

Heap after GC invocations=2735 (full 0):
 garbage-first heap   total 75972608K, used 35714390K [0x00007f85ac000000, 0x00007f85ad0090e8, 0x00007f9c2c000000)
  region size 16384K, 19 young (311296K), 19 survivors (311296K)
 Metaspace       used 117153K, capacity 137030K, committed 147200K, reserved 147456K
}
 [Times: user=1.18 sys=0.00, real=0.05 secs] 
{Heap before GC invocations=2735 (full 0):
 garbage-first heap   total 75972608K, used 39187798K [0x00007f85ac000000, 0x00007f85ad0090e8, 0x00007f9c2c000000)
  region size 16384K, 231 young (3784704K), 19 survivors (311296K)
 Metaspace       used 117163K, capacity 137046K, committed 147200K, reserved 147456K
2020-10-17T21:36:32.655+0800: 372540.709: [GC pause (G1 Evacuation Pause) (mixed), 0.0394267 secs]
   [Parallel Time: 30.7 ms, GC Workers: 28]
      [GC Worker Start (ms): Min: 372540709.3, Avg: 372540709.6, Max: 372540709.8, Diff: 0.6]
      [Ext Root Scanning (ms): Min: 1.3, Avg: 3.0, Max: 24.0, Diff: 22.7, Sum: 84.4]
      [Update RS (ms): Min: 0.0, Avg: 5.3, Max: 6.2, Diff: 6.2, Sum: 148.2]
         [Processed Buffers: Min: 0, Avg: 17.9, Max: 40, Diff: 40, Sum: 500]
      [Scan RS (ms): Min: 0.0, Avg: 1.0, Max: 1.2, Diff: 1.2, Sum: 28.0]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 5.6, Avg: 20.7, Max: 21.4, Diff: 15.8, Sum: 580.0]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 28]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.5, Diff: 0.4, Sum: 2.7]
      [GC Worker Total (ms): Min: 29.8, Avg: 30.1, Max: 30.5, Diff: 0.7, Sum: 843.6]
      [GC Worker End (ms): Min: 372540739.6, Avg: 372540739.7, Max: 372540739.8, Diff: 0.1]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.7 ms]
   [Other: 7.9 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 5.0 ms]
      [Ref Enq: 0.5 ms]
      [Redirty Cards: 0.5 ms]
      [Humongous Register: 0.2 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.7 ms]
   [Eden: 3392.0M(3392.0M)->0.0B(30.7G) Survivors: 304.0M->400.0M Heap: 37.4G(72.5G)->34.1G(72.5G)]

qingchengliu commented 4 years ago

@javeme 如下是调整参数后 jmap -histo:live pid 输出的 top10

   1:        214045     4409019960  [B
   2:      91304589     4382620272  java.util.HashMap$Node
   3:      25822197     3609635176  [Ljava.util.HashMap$Node;
   4:      61184880     2447395200  com.baidu.hugegraph.structure.HugeVertexProperty
   5:      26522364     2238671488  [C
   6:      21020536     1345314304  java.util.HashMap
   7:      15583758     1246700640  com.baidu.hugegraph.structure.HugeVertex
   8:      31889683     1020469856  java.lang.String
   9:      10678073      939670424  com.baidu.hugegraph.structure.HugeEdge
  10:      10678073      768821256  com.baidu.hugegraph.backend.id.EdgeId

服务端版本号：10.4 客户端版本号：1.8.10 客户端连接配置

HugeClient client = new HugeClient(new HugeClientBuilder(url, "hugegraph")
                    .configTimeout(60)
                    .configIdleTime(100)
                    .configPool(20, 10));

线上写入流量-写入节点：

Map<String, Object> properties = GraphUtil.toUnderline(node);
LOGGER.debug("VertexServiceImpl updateOrAdd properties={}", properties);
VertexLogic.checkUpdateProperties(properties);
long start = System.currentTimeMillis();
// 属性填充
Vertex vertex = new Vertex((String) properties.get(T.label));
vertex.id(properties.get(T.id));
VertexLogic.attachProperties(vertex, properties);

// 若update失败，则为true，转为insert
boolean updateFail = false;
try {
    HGClient.writeClient().graph().appendVertexProperty(vertex);
    LOGGER.debug("VertexServiceImpl updateOrAdd update node costTime {}", System.currentTimeMillis() - start);
} catch (com.baidu.hugegraph.exception.ServerException ex) {
    // 若返回404 - java.util.NoSuchElementException 则为该节点不存在，转换为新增节点
    if (ex.status() == HttpStatus.SC_NOT_FOUND && ex.exception().contains("java.util.NoSuchElementException")) {
        updateFail = true;
    } else {
        // 上报系统错误
        throw new HugeGraphException(ErrorCode.SYSTEM_ERROR.setFormatMsg(ex.getMessage()));
    }
}

if (updateFail) {
    VertexLogic.checkAddProperties(properties);
    try {
        HGClient.writeClient().graph().addVertex(vertex);
    } catch (com.baidu.hugegraph.exception.ServerException ex) {
        // 若返回400 参数错误
        if (ex.status() == HttpStatus.SC_BAD_REQUEST && ex.exception().contains("java.lang.IllegalArgumentException")) {
            // 上报参数错误
        } else { // 系统错误
            // 上报系统错误
        }
    }
}
// 上报平均耗时

线上写入流量-写入边

Map<String, Object> properties = GraphUtil.toUnderline(edge);
Edge e = new Edge(edge.getLabel());
e.sourceLabel(source.getLabel());
e.sourceId(source.getId());
e.targetLabel(target.getLabel());
e.targetId(target.getId());
EdgeLogic.attachProperties(e, properties);
try {
    e = HGClient.writeClient().graph().addEdge(e);
} catch (ServerException ex) {
    if (ex.status() == HttpStatus.SC_BAD_REQUEST) {
        // 上报参数错误
        throw new HugeGraphException(ErrorCode.PARAM_REQUIRED.setFormatMsg(ex.getMessage()));
    } else {
        throw new HugeGraphException(ErrorCode.SYSTEM_ERROR.setFormatMsg(ex.getMessage()));
    }
}
// 上报平均耗时

gremlin binding 查询

HugeClient client = HGClient.readClient(invoker);
GremlinRequest.Builder builder = client.gremlin().gremlin(sql);
for (Map.Entry<String, Object> bind : binding.entrySet()) {
    builder.binding(bind.getKey(), bind.getValue());
}
resultSet = builder.execute();

g.V(VID).outE().hasLabel(LABEL).has('modify_time', gt(START_TIME_30)).inV().dedup().local(inE().hasLabel(LABEL).has('modify_time', gte(START_TIME_180)).order().by('modify_time', desc).limit(200).outV().dedup()).values('score').union(max(), mean())

写入调用频率约 10000/min，每日调用约3500w次（7*24h 不中断调用）

查询调用频率约 1000/min，7*24h 不中断调用

javeme commented 4 years ago

@qingchengliu ConcurrentHashMap$Node是多少呢？或者可以多贴一些到top40。顶点缓存是1万，可是堆里的顶点对象达到1千万，这个不正常。

老年代对象占用问题大概率是gremlin查询导致的，如果是为了定位出这个问题的话，可以尝试仅仅发送查询请求，看看是否能复现。如果能构造出最小复现代码会更容易找到原因（包括生成数据集脚本、查询语句、建模语句）。

请教一下在我的使用场景下，java.util.HashMap$Node 主要是在哪儿产生的

应该是顶点或边的属性Map。

qingchengliu commented 4 years ago

@qingchengliu ConcurrentHashMap$Node是多少呢？或者可以多贴一些到top40。顶点缓存是1万，可是堆里的顶点对象达到1千万，这个不正常。

以下贴的对象top，是缓存配置为20w左右时打印的。

 num     #instances         #bytes  class name
----------------------------------------------
   1:        214045     4409019960  [B
   2:      91304589     4382620272  java.util.HashMap$Node
   3:      25822197     3609635176  [Ljava.util.HashMap$Node;
   4:      61184880     2447395200  com.baidu.hugegraph.structure.HugeVertexProperty
   5:      26522364     2238671488  [C
   6:      21020536     1345314304  java.util.HashMap
   7:      15583758     1246700640  com.baidu.hugegraph.structure.HugeVertex
   8:      31889683     1020469856  java.lang.String
   9:      10678073      939670424  com.baidu.hugegraph.structure.HugeEdge
  10:      10678073      768821256  com.baidu.hugegraph.backend.id.EdgeId
  11:      10670359      682902976  java.util.LinkedHashMap$Entry
  12:      18605941      595390112  java.util.Date
  13:       4829632      425007616  java.util.LinkedHashMap
  14:      10198356      244760544  java.util.HashMap$EntrySet
  15:      10147675      243544200  java.util.HashMap$Values
  16:       5385256      215410240  com.baidu.hugegraph.structure.HugeEdgeProperty
  17:       7777191      186652584  java.lang.Long
  18:       4815997      115583928  java.util.LinkedHashSet
  19:       4815429      115570296  com.baidu.hugegraph.backend.id.IdGenerator$StringId
  20:       4050824       97219776  java.lang.Integer
  21:       3573799       85771176  java.lang.Double
  22:        392641       40834664  org.codehaus.groovy.runtime.metaclass.MetaMethodIndex$Entry
  23:        352380       35054488  [Ljava.lang.Object;
  24:        498103       23908944  com.huaban.analysis.jieba.DictSegment
  25:        251085       20086800  org.glassfish.grizzly.http.util.ByteChunk
  26:        168939       17569656  org.glassfish.grizzly.memory.PooledMemoryManager$PoolHeapBuffer
  27:        260686       16683904  org.glassfish.grizzly.http.util.CharChunk
  28:        263885       14777560  org.glassfish.grizzly.http.util.BufferChunk
  29:        244318       13681808  org.glassfish.grizzly.http.util.DataChunk
  30:        202869        9737712  java.util.concurrent.ConcurrentHashMap$Node
  31:        285774        9144768  org.codehaus.groovy.util.FastArray
  32:        186660        8959680  [Lcom.huaban.analysis.jieba.DictSegment;
  33:         30455        8529192  [Lorg.codehaus.groovy.util.ComplexKeyHashMap$Entry;
  34:         43653        6984480  java.lang.reflect.Method
  35:          2182        4506256  [Lorg.codehaus.groovy.runtime.metaclass.MetaMethodIndex$Entry;
  36:        173187        4156488  java.util.concurrent.atomic.AtomicInteger
  37:         23040        3954408  java.lang.Class
  38:         78304        3758592  org.codehaus.groovy.util.SingleKeyHashMap$Entry
  39:           266        3356848  [Ljava.util.concurrent.ConcurrentHashMap$Node;
  40:         69623        2784920  org.glassfish.grizzly.http.util.MimeHeaderField

可以尝试仅仅发送查询请求，看看是否能复现。如果能构造出最小复现代码会更容易找到原因（包括生成数据集脚本、查询语句、建模语句）

我正尝试在项目环境上复现，生产环境不太敢再随意改动

javeme commented 4 years ago

@qingchengliu 最小复现代码是否准备好？@zhoney @lxb1111 请关注下

HerovNL commented 4 years ago

我们这里用的后端是MySQL，进行代码扫描时，发现MysqlSessions.Session中查询select(sql)中临时生成的statement及查询结果ResultSet都没有找到在哪里close的释放的，后来不得不设计了一个动态代理类包含此处的statement及ResultSet对外呈现ResultSet接口，并记录，在任务或线程结束时回收释放掉；

qingchengliu commented 3 years ago

@qingchengliu 最小复现代码是否准备好？@zhoney @lxb1111 请关注下

@javeme 最小复现代码如下：

Map<String, Object> properties = GraphUtil.toUnderline(node);
LOGGER.debug("VertexServiceImpl updateOrAdd properties={}", properties);
VertexLogic.checkUpdateProperties(properties);
long start = System.currentTimeMillis();
// 属性填充
Vertex vertex = new Vertex((String) properties.get(T.label));
vertex.id(properties.get(T.id));
VertexLogic.attachProperties(vertex, properties);

// 若update失败，则为true，转为insert
boolean updateFail = false;
try {
    HGClient.writeClient().graph().appendVertexProperty(vertex);
    LOGGER.debug("VertexServiceImpl updateOrAdd update node costTime {}", System.currentTimeMillis() - start);
} catch (com.baidu.hugegraph.exception.ServerException ex) {
    // 若返回404 - java.util.NoSuchElementException 则为该节点不存在，转换为新增节点
    if (ex.status() == HttpStatus.SC_NOT_FOUND && ex.exception().contains("java.util.NoSuchElementException")) {
        updateFail = true;
    } else {
        // 上报系统错误
        throw new HugeGraphException(ErrorCode.SYSTEM_ERROR.setFormatMsg(ex.getMessage()));
    }
}

if (updateFail) {
    VertexLogic.checkAddProperties(properties);
    try {
        HGClient.writeClient().graph().addVertex(vertex);
    } catch (com.baidu.hugegraph.exception.ServerException ex) {
        // 若返回400 参数错误
        if (ex.status() == HttpStatus.SC_BAD_REQUEST && ex.exception().contains("java.lang.IllegalArgumentException")) {
            // 上报参数错误
        } else { // 系统错误
            // 上报系统错误
        }
    }
}
// 上报平均耗时

更新\写入调用频率约 20000/min，每日调用约3500w次（7*24h 不中断调用）

javeme commented 3 years ago

@qingchengliu 从top看也没有发现根源在哪里，看起来需要jmap dump内存，分析对象引用链。我们也会尝试复现这个问题。

javeme commented 2 years ago

Close due to the lack of activity, feel free to reopen if not fixed in new version.

apache / incubator-hugegraph