Netflix / atlas

In-memory dimensional time series database.
Apache License 2.0
3.45k stars 305 forks source link

Metrics delay #364

Closed drax68 closed 8 years ago

drax68 commented 8 years ago

Hi, we noticed interesting atlas behaviour. After some period of time we observe increasing delay for new metrics, see attached ss atlas

Also heap consumption never stops growing. Current configuration (atlas 1.5.0-rc5)

atlas {

  core {
    model {
      step = 10s
    }

    db {
      class = "com.netflix.atlas.core.db.MemoryDatabase"
      //class = "com.netflix.atlas.core.db.StaticDatabase"

      // How often to rebuild the index for the memory database
      rebuild-frequency = 30s

      // 1h with 10s step
      block-size = 360

      // 25h of data overall
      num-blocks = 25
    }
  }

  webapi {
    publish {
      // Validation rules to apply before accepting input data
      rules = [
        {
          class = "com.netflix.atlas.core.validation.ValueLengthRule"
          min-length = 1
          max-length = 180
        }
        ]
    }
    main {
      port = 7101
    }

    graph {
      // Change default start time on graph to smaller range more typical for testing
      start-time = e-30m
    }
  }

  akka {
    api-endpoints = ${?atlas.akka.api-endpoints} [
      "com.netflix.atlas.webapi.PublishApi"
    ]
  }
}

Jvm options:

    -Dfile.encoding=UTF-8 \
    -Duser.timezone=UTC \
    -server \
    -Djava.net.preferIPv4Stack=true \
    -Dspray.can.parsing.max-uri-length=16k \
    -XX:+UseParNewGC \
    -XX:+UseConcMarkSweepGC \
    -XX:CMSInitiatingOccupancyFraction=75 \
    -XX:+UseCMSInitiatingOccupancyOnly \
    -XX:+DisableExplicitGC \
    -XX:+PrintGC \
    -XX:+PrintGCDetails \
    -XX:+PrintGCTimeStamps \
    -XX:GCLogFileSize=20M \
    -XX:NumberOfGCLogFiles=15 \
    -XX:+UseGCLogFileRotation \
    -XX:+PrintGCDateStamps \
    -XX:+PrintPromotionFailure \
    -XX:-ReduceInitialCardMarks

What can we check to debug that issue?

brharrington commented 8 years ago

What do the index logs show? There should be logs that look like:

2016-05-31T08:16:00.259 INFO  [MemoryDatabaseRebuildIndex] com.netflix.atlas.core.db.MemoryDatabase: rebuilding metadata index
2016-05-31T08:16:00.260 INFO  [MemoryDatabaseRebuildIndex] com.netflix.atlas.core.db.MemoryDatabase: done rebuilding metadata index, 2 metrics

These will tell how long it is taking to index the content and the number of distinct time series in the index.

A couple of other things to note:

  1. We typically don't use CMS because it is prone to fragmentation problems. Right now we run with G1.
  2. Tuning is sensitive to data characteristics.
    • How many distinct strings are there? We intern strings to reduce memory of duplication, but you need to be mindful of the string table size, e.g., -XX:StringTableSize=1000003.
    • If you do a heap dump, what objects are taking up the most memory? For example, something like jmap -histo:live <pid> | head -n50 would help.
drax68 commented 8 years ago

Checked logs - rebuild takes up to 60 seconds for 1340674 metrics. Tried c1 - generally looks better, but after 25 hours ends up with hours of delay and most of the heap utilized.

Heap Configuration:
   MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 115813122048 (110448.0MB)
   NewSize                  = 1363144 (1.2999954223632812MB)
   MaxNewSize               = 69474451456 (66256.0MB)
   OldSize                  = 5452592 (5.1999969482421875MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 16777216 (16.0MB)

Heap Usage:
G1 Heap:
   regions  = 6903
   capacity = 115813122048 (110448.0MB)
   used     = 81499614872 (77724.08950042725MB)
   free     = 34313507176 (32723.910499572754MB)
   70.37165860896282% used
G1 Young Generation:
Eden Space:
   regions  = 82
   capacity = 4177526784 (3984.0MB)
   used     = 1375731712 (1312.0MB)
   free     = 2801795072 (2672.0MB)
   32.93172690763052% used
Survivor Space:
   regions  = 21
   capacity = 352321536 (336.0MB)
   used     = 352321536 (336.0MB)
   free     = 0 (0.0MB)
   100.0% used
G1 Old Generation:
   regions  = 4762
   capacity = 82023809024 (78224.0MB)
   used     = 79771561624 (76076.08950042725MB)
   free     = 2252247400 (2147.910499572754MB)
   97.25415409647582% used

154753 interned Strings occupying 35269128 bytes.

atlas2

jmap -histo:live - not able to connect to service, trying with jmap -histo -F

drax68 commented 8 years ago
Object Histogram:

num       #instances    #bytes  Class description
--------------------------------------------------------------------------
1:              13735175        18898908672     double[]
2:              21180760        677784320       java.util.concurrent.LinkedBlockingQueue$Node
3:              1735448 611425608       java.lang.Object[]
4:              14783706        591348240       com.netflix.atlas.core.model.ConstantBlock
5:              5146349 591195064       char[]
6:              7649875 367194000       com.netflix.atlas.core.model.SparseBlock
7:              1340683 300312992       com.netflix.atlas.core.model.Block[]
8:              6085267 243410680       com.netflix.atlas.core.model.ArrayBlock
9:              141281  218798360       short[]
10:             454     189877912       long[]
11:             3950949 189645552       java.util.HashMap$Node
12:             4728340 151306880       java.lang.String
13:             3100611 148829328       com.netflix.spectator.api.TagList
14:             2999382 143970336       java.lang.ref.WeakReference
15:             2999184 143960832       com.netflix.spectator.api.ObjectGauge
16:             2783140 133590720       java.math.BigInteger
17:             2072099 132614336       java.nio.StringCharBuffer
18:             3220653 103060896       java.util.concurrent.ConcurrentLinkedQueue$Node
19:             3099824 99194368        com.netflix.spectator.api.DefaultId
20:             2478879 99155160        spray.routing.RequestContext
21:             1365951 87420864        java.util.LinkedHashMap$Entry
22:             1340683 85803712        com.netflix.atlas.core.db.MemoryBlockStore
23:             69      80330528        com.netflix.atlas.core.db.BlockStoreItem[]
24:             1577770 75732960        java.util.concurrent.ConcurrentHashMap$Node
25:             1272530 71261680        com.netflix.atlas.core.norm.NormalizeValueFunction
26:             1441759 69204432        com.netflix.atlas.core.db.BlockStoreItem
27:             71      67998608        scala.collection.immutable.Map[]
28:             752050  63781168        java.lang.String[]
29:             1976546 63249472        scala.Tuple2
30:             1381050 55242000        com.netflix.atlas.core.model.BasicTaggedItem
31:             1115383 53538384        spray.routing.RequestContext$$anon$3
32:             1114800 53510400        spray.routing.RequestContext$$anon$1
33:             1329860 53194400        com.netflix.atlas.core.norm.NormalizationCache$CacheValue
34:             2048757 49170168        scala.runtime.ObjectRef
35:             1382814 44250048        com.netflix.atlas.core.util.SmallHashMap
36:             1329860 42555520        com.netflix.atlas.core.norm.UpdateValueFunction
37:             99      41781072        java.math.BigInteger[]
38:             416883  30015576        gnu.trove.set.hash.TIntHashSet
39:             8711    28137480        java.util.HashMap$Node[]
40:             123921  27758304        java.text.DecimalFormat
41:             691954  27678160        com.netflix.atlas.core.index.SortedArraySet
42:             688248  27529920        com.netflix.atlas.core.model.Tag
drax68 commented 8 years ago

@brharrington, anything else to check?

brharrington commented 8 years ago

Somehow I missed the previous update. I'll take a look this weekend.

drax68 commented 8 years ago

@brharrington, any update on this issue? We still suffer from it on a daily basis.

drax68 commented 8 years ago

Anything new regarding that issue?

drax68 commented 8 years ago

Issue fixed after multiple memory-related commits. Thank you, @brharrington.