Closed drax68 closed 8 months ago
That histogram is odd, there are entries in the middle that are larger than the total at the end:
$ cat ~/Downloads/atlas_hist.txt | awk '$3 > 1e9 { print $0 }'
Attaching to process ID 125769, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.111-b14
1: 41582635 1995966480 com.netflix.atlas.core.model.SparseBlock
2: 54943829 1758202528 java.util.concurrent.LinkedBlockingQueue$Node
3: 7209379 1614900896 com.netflix.atlas.core.model.Block[]
4: 180316237 5770119584 scala.collection.immutable.$colon$colon
5: 26011471 1040458840 com.netflix.atlas.core.model.ArrayBlock
3966: 97716612 16607411792 char[]
3967: 91978702 3679148080 com.netflix.atlas.core.model.ConstantBlock
3968: 97398362 3116747584 java.lang.String
3969: 15836174 3090900488 int[]
3970: 41864518 28828263048 byte[]
3971: 7584856 3005964040 java.lang.Object[]
3972: 70755286 2830211440 com.netflix.atlas.core.model.Datapoint
3973: 67594139 15707960408 double[]
3974: 86086260 2754760320 com.netflix.atlas.core.validation.ValidationResult$Fail
Total : 1057826155 1677395096
Based on what I'm seeing at the end, it looks like data is backing up on the publishing side until it falls over. The ValidationResult$Fail
result implies that there are a lot of results that are failing validation. Maybe that is causing it to go slower. Can you look at the responses and summarize what the validation failure are?
We hit an issue with memory leaks after introducing nginx balancer with mirror module
Did anything change other than that? Was it working fine with the same data prior to introducing nginx? Off the top of my head I don't know why the nginx layer would make a difference.
Let me create a new histogram. Regarding changes - nothing else was changed, except of nginx layer in front of atlas instances. We even moved all automated queries to a dedicated atlas instance with short retention, but main atlas cluster is still timing out on publishing and leaking memory over time.
Here what we see in nginx during every metadata rebuild These are all 502 errors
Regarding validation failures
<Logger name="com.netflix.spectator.sandbox.HttpLogEntry" level="debug"/>
will be enough to catch them?
The ValidationResult$Fail result implies that there are a lot of results that are failing validation. Maybe that is causing it to go slower. Can you look at the responses and summarize what the validation failure are?
We could have datapoints in payload which exceed the age limit. Here for each failed datapoint it creates ValidationResult and appends it into the response payload. Could it be the case?
If so, then maybe, Atlas could have an option to fail fast with a simple response.
Adding fresh histogram for 100Gb heap atlas_hist.txt
If so, then maybe, Atlas could have an option to fail fast with a simple response.
Maybe, it isn't clear that is the cause of the issue though. If the failure is being too old, then it could be partially a symptom assuming the backlog is delaying processing.
The total for the new histogram is negative. Looks like that is a known bug that they do not intend to fix:
https://bugs.openjdk.java.net/browse/JDK-7012905 https://bugs.openjdk.java.net/browse/JDK-6539434
Did you run -histo:live
or just -histo
?
From the dumps my best guess is that it is getting overwhelmed due to amount of data coming in and handling it badly. I'll try to setup a test and reproduce.
-histo:live wasn't able to connect to jvm process so I've used -F -histo.
@brharrington Anything new about this? Our production Atlas is crippled due to this.
I have some tests setup and running, should have some results in a few days.
@brharrington Thank you very much for your effort! Let us know if you need more data to help you with this.
@brharrington seems any non-default value for akka's max-connections causes memory leak after short uptime. Can you confirm that 100 requests/second is not problematic for atlas /publish endpoint?
any non-default value for akka's max-connections causes memory leak after short uptime.
Interesting. Did you change that setting when you started using nginx?
Can you confirm that 100 requests/second is not problematic for atlas /publish endpoint?
As always it depends a bit on data characteristics, validation settings, and overall load on the system. I can say we run some clusters with a higher rate.
We don't typically run with heaps larger than around 60g because the GC pauses tend to become problematic. That said, it has been a while since we have stress tested that, but we are starting to look at it again. So in short your configuration is a bit different than what we run, but doesn't seem unreasonable.
@drax68 ^^^
Our publishing traffic is actually 10 - 15 rqs/sec and not 100 rqs/sec.
Interesting. Did you change that setting when you started using nginx?
Property was changed more than year ago, atlas was working fine all this time.
Latest 1.6.0 rc looks good in terms of memory consumption, but we observe various timeouts on publishing/quering. Is this something known to you? Tried it with akka.http.host-connection-pool.pool-implementation=new akka.http.server.linger-timeout=infinite
Almost 50% of requests timing out during every metadata index rebuild. Some kind of threads starvation?
akka.http.host-connection-pool.pool-implementation=new
That is a client setting, so it shouldn't impact the behavior of the server.
Almost 50% of requests timing out during every metadata index rebuild. Some kind of threads starvation?
There is an isolated thread for building the index. If I had to guess, you are probably seeing high GC pauses when the index is rebuilt. Are you tracking that either with GC logs or something like spectator-ext-gc?
I have vebrose gc logs enabled. Gc pauses during rebuild process are up to 7.5s.
@brharrington Anything new about this? We still have these issues.
Just to be clear, the issues are the timeouts mentioned in https://github.com/Netflix/atlas/issues/737#issuecomment-351705580? Do long the GC pauses correspond with the timeouts?
I would guess it is a bit overloaded and we'll need to look at making the index rebuilds cheaper. This is something we are interested in as well and I hope to get to it soon. Right now though I'm looking at some client side performance regressions that came up after meltdown remediation was applied. So it will probably be a week or so before I can get back to looking a the indexing.
We hit an issue with memory leaks after introducing nginx balancer with mirror module in front of 3 atlas instances. Reproducible both for 1.5.3 and for 1.6.0.rc.8. Tested on jvm 8u111 and 8u151. Atlas running on r3.4xlarge instances.
Total amount of metrics reported during rebuild - ~8M Rebuild takes around 7 minutes, during that period nginx receives timeouts from atlas publish backend. thread-pool-executor improved performance and reduced speed of that memory leak, but atlas still dies due to oom after 1-2 days of uptime even with 12h metrics retention (usually it's 25h). Our publishing rate ~100rps, query rate ~10-50rps.
Atlas config
Jvm options:
Nginx config, nginx 1.13.6 with ngx_devel_kit-0.3.0, lua-nginx-module-0.10.10, lua-upstream-nginx-module-0.07:
Atlas jmap histogram