brettwooldridge / HikariCP

光 HikariCP・A solid, high-performance, JDBC connection pool at last.
Apache License 2.0
19.98k stars 2.93k forks source link

HikariCP Housekeeper Thread appears to be holding onto a lot of memory (103MB?) #1688

Open frankjkelly opened 3 years ago

frankjkelly commented 3 years ago

We have 750MB of RAM allocated to our Spring App with a pool of 200 Connections against a PostgreSQL DB (10.7)

Java Runtime: OpenJDK Runtime Environment 11.0.6+10-post-Debian-1bpo91 Deployment: CentOS Linux

2020-11-10 19:15:44.556 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : HikariPool-1 - configuration:
2020-11-10 19:15:44.557 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : allowPoolSuspension.............false
2020-11-10 19:15:44.557 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : autoCommit......................true
2020-11-10 19:15:44.557 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : catalog.........................none
2020-11-10 19:15:44.557 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : connectionInitSql...............none
2020-11-10 19:15:44.637 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : connectionTestQuery.............none
2020-11-10 19:15:44.637 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : connectionTimeout...............10000
2020-11-10 19:15:44.637 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : dataSource......................none
2020-11-10 19:15:44.637 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : dataSourceClassName.............none
2020-11-10 19:15:44.637 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : dataSourceJNDI..................none
2020-11-10 19:15:44.638 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : dataSourceProperties............{password=<masked>}
2020-11-10 19:15:44.638 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : driverClassName................."org.postgresql.Driver"
2020-11-10 19:15:44.638 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : exceptionOverrideClassName......none
2020-11-10 19:15:44.638 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : healthCheckProperties...........{}
2020-11-10 19:15:44.638 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : healthCheckRegistry.............none
2020-11-10 19:15:44.638 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : idleTimeout.....................600000
2020-11-10 19:15:44.638 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : initializationFailTimeout.......1
2020-11-10 19:15:44.638 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : isolateInternalQueries..........false
2020-11-10 19:15:44.638 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : jdbcUrl.........................jdbc:postgresql://vpc-023f389b92783a7b5-eks-saas-cogito-saas-signal.cluster-c7ikyiogwzaf.us-east-1.rds.amazonaws.com:5432/signal
2020-11-10 19:15:44.638 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : leakDetectionThreshold..........15000
2020-11-10 19:15:44.638 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : maxLifetime.....................1800000
2020-11-10 19:15:44.638 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : maximumPoolSize.................200
2020-11-10 19:15:44.638 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : metricRegistry..................none
2020-11-10 19:15:44.638 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : metricsTrackerFactory...........none
2020-11-10 19:15:44.639 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : minimumIdle.....................200
2020-11-10 19:15:44.639 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : password........................<masked>
2020-11-10 19:15:44.639 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : poolName........................"HikariPool-1"
2020-11-10 19:15:44.639 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : readOnly........................false
2020-11-10 19:15:44.639 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : registerMbeans..................false
2020-11-10 19:15:44.639 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : scheduledExecutor...............none
2020-11-10 19:15:44.639 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : schema..........................none
2020-11-10 19:15:44.639 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : threadFactory...................internal
2020-11-10 19:15:44.639 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : transactionIsolation............default
2020-11-10 19:15:44.639 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : username........................"platform"
2020-11-10 19:15:44.639 DEBUG 1 --- [           main] com.zaxxer.hikari.HikariConfig           : validationTimeout...............5000

Even without any traffic the heap memory increases and increases and eventually dies running out of memory.

After taking a Heap Dump and running it Eclipse Memory Analyzer it implicates Hikari (maybe directly or indirectly?) Saying there is 1 problem suspect consuming > 100MB

400 instances of "sun.security.ssl.SSLSocketImpl", loaded by "<system class loader>" 
occupy 108,131,200 (64.78%) bytes. 

Keywords
sun.security.ssl.SSLSocketImpl

When I click through to the Threads it shows

Class Name                                                                                                     | Shallow Heap | Retained Heap
----------------------------------------------------------------------------------------------------------------------------------------------
java.lang.Thread @ 0xe1371558  HikariPool-1 housekeeper Thread                                                 |          120 |           592
|- <class> class java.lang.Thread @ 0xe0135bc0 System Class                                                    |           40 |           472
|- contextClassLoader jdk.internal.loader.ClassLoaders$AppClassLoader @ 0xe011c2f0                             |          104 |     2,607,728
|- group java.lang.ThreadGroup @ 0xe0138750  main JNI Global                                                   |           48 |         1,440
|- <Java Local> java.lang.Thread @ 0xe1371558  HikariPool-1 housekeeper Thread                                 |          120 |           592
|- <Java Local>, parkBlocker java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject @ 0xe13716d0|           24 |            24
|- <Java Local> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue @ 0xe13716e8                 |           32 |         1,136
|- <Java Local> java.util.concurrent.locks.ReentrantLock @ 0xe1371708                                          |           16 |            16
|- <Java Local> java.util.concurrent.ScheduledThreadPoolExecutor @ 0xe1371718                                  |           80 |           384
|- <Java Local>, target java.util.concurrent.ThreadPoolExecutor$Worker @ 0xe1371768                            |           48 |            48
|- inheritedAccessControlContext java.security.AccessControlContext @ 0xe13a3868                               |           40 |            88
|- name java.lang.String @ 0xe13bd118  HikariPool-1 housekeeper                                                |           24 |            64
|- blockerLock java.lang.Object @ 0xe13bd130                                                                   |           16 |            16
|- threadLocals java.lang.ThreadLocal$ThreadLocalMap @ 0xe14d9980                                              |           24 |           368
|- <Java Local> java.util.concurrent.locks.AbstractQueuedSynchronizer$Node @ 0xea238e50                        |           32 |            32
'- Total: 14 entries                                                                                           |              |              
----------------------------------------------------------------------------------------------------------------------------------------------

plus also

Class Name                                                        | Shallow Heap | Retained Heap
-------------------------------------------------------------------------------------------------
java.lang.ref.Finalizer @ 0xe61a08c8 JNI Global                   |           40 |        64,664
|- <class> class java.lang.ref.Finalizer @ 0xe0135ad0 System Class|           16 |            32
|- queue java.lang.ref.ReferenceQueue @ 0xe0137928                |           32 |            32
|- next java.lang.ref.Finalizer @ 0xe2e91038                      |           40 |         1,800
|- next java.lang.ref.Finalizer @ 0xe61a08c8 JNI Global           |           40 |        64,664
|- prev, discovered java.lang.ref.Finalizer @ 0xe61a3450          |           40 |            40
|- referent org.postgresql.jdbc.PgConnection @ 0xe61a4d80         |          128 |        62,824
-------------------------------------------------------------------------------------------------

Hope that helps. . . it could possibly be the "leak" is downstream e.g. postgresql driver or Java JRE but I'd appreciate any help. Thanks!

frankjkelly commented 3 years ago

Update with a little more data. Let the app run for a few hours then looked at the Histogram

The object with the largest heap was this

Class Name              | Objects | Shallow Heap
-------------------------------------------------
java.util.HashMap$Node[]|  10,446 |  107,051,112
-------------------------------------------------

And then when using MAT I do path2gc I get

Class Name                                                                                                                    | Shallow Heap | Retained Heap
-------------------------------------------------------------------------------------------------------------------------------------------------------------
java.util.HashMap$Node[32768] @ 0xeaf6ed48                                                                                    |      131,088 |       131,088
'- table java.util.LinkedHashMap @ 0xea4b8288                                                                                 |           56 |       131,232
   '- cacheMap sun.security.util.MemoryCache @ 0xea1f6820                                                                     |           32 |       131,312
      '- sessionCache sun.security.ssl.SSLSessionContextImpl @ 0xea022228                                                     |           32 |       262,800
         '- clientCache sun.security.ssl.SSLContextImpl$TLSContext @ 0xe9cf49d0                                               |           48 |           704
            '- sslContext sun.security.ssl.SSLSocketImpl @ 0xe9b5d168                                                         |           64 |       270,328
               '- connection org.postgresql.core.PGStream @ 0xe9afacf0                                                        |           88 |        35,456
                  '- pgStream org.postgresql.core.v3.QueryExecutorImpl @ 0xe9ad2720                                           |          216 |        40,984
                     '- queryExecutor org.postgresql.jdbc.PgConnection @ 0xe9ac6850                                           |          128 |        62,304
                        '- connection com.zaxxer.hikari.pool.PoolEntry @ 0xe9ac1300                                           |           56 |           160
                           |- arg$2 com.zaxxer.hikari.pool.HikariPool$$Lambda$744 @ 0xe9ac45a0                                |           24 |            24
                           |  '- task java.util.concurrent.Executors$RunnableAdapter @ 0xe9ac0010                             |           24 |            48
                           |     '- callable java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask @ 0xe9ab4200|           72 |           120
                           |        '- [94] java.util.concurrent.RunnableScheduledFuture[271] @ 0xe1f62120                    |        1,104 |         1,104
                           |           '- queue java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue @ 0xe1318f78|           32 |         1,136
                           |              |- <Java Local> java.lang.Thread @ 0xe1318de8  HikariPool-1 housekeeper Thread      |          120 |           592
                           |              |- workQueue java.util.concurrent.ScheduledThreadPoolExecutor @ 0xe1318fa8          |           80 |           384
-------------------------------------------------------------------------------------------------------------------------------------------------------------

And there are over 10k of these at ~128kB each

frankjkelly commented 3 years ago

Turns out that we were running it in Kubernetes with 1 vcpu and 1 GB of RAM assigned to the container and Java (in all it's wisdom) decided to use the SerialGC collector. Once we forced it to use G1GC collector this problem went away. I would have presumed that SerialGC would have operated mostly the same and certainly that it would be result in an OOMKilled error even if was a sub-optimal collector. Anyway feel free to leave this open for investigation if of interest to debug further or close. Thanks!

nunnrlc commented 3 years ago

Appreciate the follow-up on how you fixed it! I'm quite impressed that changing the GC actually made a difference, maybe Java 11 improvements - if only it picked the best option itself, as you said...

Nevertheless, it should be noted 1 VCPU and 1GB RAM is likely not enough for 200 DB connections and the kind of work level that might need 200 connections! The docs recommend 3 connections for a typical single CPU server. You might significantly gain performance by reducing the number of connections.

https://github.com/brettwooldridge/HikariCP/wiki/About-Pool-Sizing#the-formula