Memory leak under heavy load

drax68 commented 7 years ago

We hit an issue with memory leaks after introducing nginx balancer with mirror module in front of 3 atlas instances. Reproducible both for 1.5.3 and for 1.6.0.rc.8. Tested on jvm 8u111 and 8u151. Atlas running on r3.4xlarge instances.

Total amount of metrics reported during rebuild - ~8M Rebuild takes around 7 minutes, during that period nginx receives timeouts from atlas publish backend. thread-pool-executor improved performance and reduced speed of that memory leak, but atlas still dies due to oom after 1-2 days of uptime even with 12h metrics retention (usually it's 25h). Our publishing rate ~100rps, query rate ~10-50rps.

Atlas config

Jvm options:

   -XX:+PreserveFramePointer \
    -Xss1024k \
    -XX:MetaspaceSize=256m \
    -XX:MaxMetaspaceSize=256m \
    -XX:StringTableSize=20000003 \
    -Xms98362m \
    -Xmx98362m \
    -Dfile.encoding=UTF-8 \
    -Duser.timezone=UTC \
    -server \
    -Djava.net.preferIPv4Stack=true \
    -Dakka.http.parsing.max-uri-length=16k \
    -Dakka.http.host-connector.max-connections=32000 \
    -Dakka.http.server.request-timeout=40s \
    -Dakka.http.server.registration-timeout=30s \
    -Dakka.http.server.idle-timeout=60s \
    -Dakka.actor.default-dispatcher.executor=thread-pool-executor \
    -Dakka.actor.default-dispatcher.thread-pool-executor.core-pool-size-max=1024 \
    -Dakka.actor.default-dispatcher.thread-pool-executor.max-pool-size-max=1024 \
    -Dakka.actor.default-dispatcher.thread-pool-executor.core-pool-size-min=256 \
    -Dakka.actor.default-dispatcher.thread-pool-executor.max-pool-size-min=256 \
    -Dakka.actor.default-dispatcher.fork-join-executor.parallelism-factor=32 \
    -Dakka.actor.default-dispatcher.fork-join-executor.parallelism-max=128 \
    -Dakka.actor.default-dispatcher.throughput=100 \
    -XX:+UseG1GC \
    -XX:G1HeapRegionSize=32m \
    -XX:+DisableExplicitGC \
    -Dcom.sun.management.jmxremote.port=8101 -Dcom.sun.management.jmxremote.local.only=false \
    -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false \
    -XX:+PrintGC \
    -XX:+PrintGCDetails \
    -XX:+PrintGCTimeStamps \
    -XX:GCLogFileSize=5M \
    -XX:NumberOfGCLogFiles=5 \
    -XX:+UseGCLogFileRotation \
    -XX:+PrintGCDateStamps \
    -XX:+PrintPromotionFailure

Nginx config, nginx 1.13.6 with ngx_devel_kit-0.3.0, lua-nginx-module-0.10.10, lua-upstream-nginx-module-0.07:

lua_package_path "/etc/nginx/?.lua;;";

lua_shared_dict healthcheck 1m;

    lua_socket_log_errors off;

    init_worker_by_lua_block {
        local hc = require "healthcheck"
        local ok, err = hc.spawn_checker{
            shm = "healthcheck",  -- defined by "lua_shared_dict"                                                                                                                                                  
            upstream = "atlas", -- defined by "upstream"                                                                                                                                                           
            type = "http",                                                                                                                                                                                         

            http_req = "GET /api/v1/graph?q=name,http.server.requestRate,:eq,:max&s=e-10m&e=now-3h&tz=UTC&format=json HTTP/1.0\r\nHost: atlas\r\n\r\n",                                                            
                    -- raw HTTP request for checking                                                                                                                                                               

            interval = 30000,  -- run the check cycle every 1 min                                                                                                                                                  
            timeout = 25000,   -- 5 sec is the timeout for network operations                                                                                                                                      
            fall = 3,  -- # of successive failures before turning a peer down                                                                                                                                      
            rise = 1,  -- # of successive successes before turning a peer up                                                                                                                                       
            valid_statuses = {200, 302},  -- a list valid HTTP status code
            match_string = "http.server.requestRate",
            concurrency = 1,  -- concurrency level for test requests
        }

        if not ok then
            ngx.log(ngx.ERR, "failed to spawn health checker: ", err)
            return
        end

        -- Just call hc.spawn_checker() for more times here if you have
        -- more upstream groups to monitor. One call for one upstream group.
        -- They can all share the same shm zone without conflicts but they
        -- need a bigger shm zone for obvious reasons.
    }

large_client_header_buffers 8 16k;

log_format atlas_combined '$remote_addr - $remote_user [$time_local]  '
    '"$request" $status $body_bytes_sent '
    '"$http_referer" "$http_user_agent" '
    '$server_port $msec $upstream_addr $request_time';

upstream atlas {
    server 1.2.3.1:7101 max_fails=0;
    keepalive 100;
}

upstream atlas-publish {
    server 1.2.3.1:7101 max_fails=0;
    keepalive 100;
}

upstream atlas-publish-mirror1 {
        server 1.2.3.2:7101 max_fails=0;
        keepalive 10;
}
upstream atlas-publish-mirror2 {
        server 1.2.3.3:7101 max_fails=0;
        keepalive 10;
}

server {
    listen 80 default;
    listen 7101;
    ignore_invalid_headers off;
    client_max_body_size 256m;
    real_ip_header     X-Forwarded-For;
    set_real_ip_from   10.0.0.0/8;
    set_real_ip_from   172.0.0.0/8;
    set_real_ip_from   192.168.0.0/16;

    proxy_next_upstream timeout;

    location = /status {
        access_log off;

        default_type text/plain;
        content_by_lua_block {
            local hc = require "healthcheck"
            ngx.say("Nginx Worker PID: ", ngx.worker.pid())
            ngx.print(hc.status_page())
        }
    }
    location /api/v1/publish {
        access_log /var/log/nginx/atlas-publish.access.log atlas_combined;
        error_log /var/log/nginx/atlas-publish.error.log;
        keepalive_timeout 60;
        proxy_read_timeout 600;
        proxy_connect_timeout 600;
        proxy_send_timeout 600;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_pass http://atlas-publish;
        mirror "/mirror1";
        mirror "/mirror2";
    }
    location / {
        access_log /var/log/nginx/atlas.access.log atlas_combined;
        error_log /var/log/nginx/atlas.error.log;
        keepalive_timeout 60;
        proxy_read_timeout 600;
        proxy_connect_timeout 600;
        proxy_send_timeout 600;
        proxy_set_header User-Agent "atlas-balancer";
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_pass http://atlas;
    }

    location /mirror1 {
        internal;
        access_log /var/log/nginx/atlas-mirror.access.log atlas_combined;
        error_log /var/log/nginx/atlas-publish.error.log;
        proxy_read_timeout 10;
        proxy_connect_timeout 10;
        proxy_send_timeout 10;
        keepalive_timeout 60;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_pass "http://atlas-publish-mirror1$request_uri";
    }
    location /mirror2 {
        internal;
        access_log /var/log/nginx/atlas-mirror.access.log atlas_combined;
        error_log /var/log/nginx/atlas-publish.error.log;
        proxy_read_timeout 10;
        proxy_connect_timeout 10;
        proxy_send_timeout 10;
        keepalive_timeout 60;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_pass "http://atlas-publish-mirror2$request_uri";
    }
}

Atlas jmap histogram

brharrington commented 7 years ago

That histogram is odd, there are entries in the middle that are larger than the total at the end:

$ cat ~/Downloads/atlas_hist.txt  | awk '$3 > 1e9 { print $0 }'
Attaching to process ID 125769, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.111-b14
1:              41582635        1995966480      com.netflix.atlas.core.model.SparseBlock
2:              54943829        1758202528      java.util.concurrent.LinkedBlockingQueue$Node
3:              7209379 1614900896      com.netflix.atlas.core.model.Block[]
4:              180316237       5770119584      scala.collection.immutable.$colon$colon
5:              26011471        1040458840      com.netflix.atlas.core.model.ArrayBlock
3966:           97716612        16607411792     char[]
3967:           91978702        3679148080      com.netflix.atlas.core.model.ConstantBlock
3968:           97398362        3116747584      java.lang.String
3969:           15836174        3090900488      int[]
3970:           41864518        28828263048     byte[]
3971:           7584856 3005964040      java.lang.Object[]
3972:           70755286        2830211440      com.netflix.atlas.core.model.Datapoint
3973:           67594139        15707960408     double[]
3974:           86086260        2754760320      com.netflix.atlas.core.validation.ValidationResult$Fail
Total :         1057826155      1677395096

Based on what I'm seeing at the end, it looks like data is backing up on the publishing side until it falls over. The ValidationResult$Fail result implies that there are a lot of results that are failing validation. Maybe that is causing it to go slower. Can you look at the responses and summarize what the validation failure are?

brharrington commented 7 years ago

We hit an issue with memory leaks after introducing nginx balancer with mirror module

Did anything change other than that? Was it working fine with the same data prior to introducing nginx? Off the top of my head I don't know why the nginx layer would make a difference.

drax68 commented 7 years ago

Let me create a new histogram. Regarding changes - nothing else was changed, except of nginx layer in front of atlas instances. We even moved all automated queries to a dedicated atlas instance with short retention, but main atlas cluster is still timing out on publishing and leaking memory over time.

Here what we see in nginx during every metadata rebuild These are all 502 errors

Regarding validation failures <Logger name="com.netflix.spectator.sandbox.HttpLogEntry" level="debug"/> will be enough to catch them?

krutsko commented 6 years ago

The ValidationResult$Fail result implies that there are a lot of results that are failing validation. Maybe that is causing it to go slower. Can you look at the responses and summarize what the validation failure are?

We could have datapoints in payload which exceed the age limit. Here for each failed datapoint it creates ValidationResult and appends it into the response payload. Could it be the case?

If so, then maybe, Atlas could have an option to fail fast with a simple response.

drax68 commented 6 years ago

Adding fresh histogram for 100Gb heap atlas_hist.txt

brharrington commented 6 years ago

If so, then maybe, Atlas could have an option to fail fast with a simple response.

Maybe, it isn't clear that is the cause of the issue though. If the failure is being too old, then it could be partially a symptom assuming the backlog is delaying processing.

brharrington commented 6 years ago

The total for the new histogram is negative. Looks like that is a known bug that they do not intend to fix:

https://bugs.openjdk.java.net/browse/JDK-7012905 https://bugs.openjdk.java.net/browse/JDK-6539434

Did you run -histo:live or just -histo?

From the dumps my best guess is that it is getting overwhelmed due to amount of data coming in and handling it badly. I'll try to setup a test and reproduce.

drax68 commented 6 years ago

-histo:live wasn't able to connect to jvm process so I've used -F -histo.

pparth commented 6 years ago

@brharrington Anything new about this? Our production Atlas is crippled due to this.

brharrington commented 6 years ago

I have some tests setup and running, should have some results in a few days.

pparth commented 6 years ago

@brharrington Thank you very much for your effort! Let us know if you need more data to help you with this.

drax68 commented 6 years ago

@brharrington seems any non-default value for akka's max-connections causes memory leak after short uptime. Can you confirm that 100 requests/second is not problematic for atlas /publish endpoint?

brharrington commented 6 years ago

any non-default value for akka's max-connections causes memory leak after short uptime.

Interesting. Did you change that setting when you started using nginx?

Can you confirm that 100 requests/second is not problematic for atlas /publish endpoint?

As always it depends a bit on data characteristics, validation settings, and overall load on the system. I can say we run some clusters with a higher rate.

We don't typically run with heaps larger than around 60g because the GC pauses tend to become problematic. That said, it has been a while since we have stress tested that, but we are starting to look at it again. So in short your configuration is a bit different than what we run, but doesn't seem unreasonable.

pparth commented 6 years ago

@drax68 ^^^

pparth commented 6 years ago

Our publishing traffic is actually 10 - 15 rqs/sec and not 100 rqs/sec.

drax68 commented 6 years ago

Interesting. Did you change that setting when you started using nginx?

Property was changed more than year ago, atlas was working fine all this time.

drax68 commented 6 years ago

Latest 1.6.0 rc looks good in terms of memory consumption, but we observe various timeouts on publishing/quering. Is this something known to you? Tried it with akka.http.host-connection-pool.pool-implementation=new akka.http.server.linger-timeout=infinite

Almost 50% of requests timing out during every metadata index rebuild. Some kind of threads starvation?

brharrington commented 6 years ago

akka.http.host-connection-pool.pool-implementation=new

That is a client setting, so it shouldn't impact the behavior of the server.

Almost 50% of requests timing out during every metadata index rebuild. Some kind of threads starvation?

There is an isolated thread for building the index. If I had to guess, you are probably seeing high GC pauses when the index is rebuilt. Are you tracking that either with GC logs or something like spectator-ext-gc?

drax68 commented 6 years ago

I have vebrose gc logs enabled. Gc pauses during rebuild process are up to 7.5s.

pparth commented 6 years ago

@brharrington Anything new about this? We still have these issues.

brharrington commented 6 years ago

Just to be clear, the issues are the timeouts mentioned in https://github.com/Netflix/atlas/issues/737#issuecomment-351705580? Do long the GC pauses correspond with the timeouts?

I would guess it is a bit overloaded and we'll need to look at making the index rebuilds cheaper. This is something we are interested in as well and I hope to get to it soon. Right now though I'm looking at some client side performance regressions that came up after meltdown remediation was applied. So it will probably be a week or so before I can get back to looking a the indexing.

Netflix / atlas

Memory leak under heavy load #737