SG gets into a 100% cpu state, where restart is the only recovery

dropwhile commented 7 years ago

Sync Gateway version

{"ADMIN":true,"couchdb":"Welcome","vendor":{"name":"Couchbase Sync Gateway","version":1.4},"version":"Couchbase Sync Gateway/1.4.0(2;9e18d3e)"}

Operating system

CentOS-7.3.1611

Config

SG is backed by couchbase bucket.

Log output

https://gist.github.com/cactus/46a86ea6f2297a03c4bb8ebb48b3675e

Expected behavior

Not use 100% cpu and continue eating ram (6GB or so) until the OOM killer reaped it.

Actual behavior

100% cpu and continued eating ram (6GB or so) until the OOM killer reaped it. This death spiral continued even after the couchbase lite agents were turned off.

Steps to reproduce

Reproduction of this one seems to be quite difficult. We currently have an app that uses couchbase lite (1.4), and it currently has some kind of conflict resolution bug, that creates many revisions and undeleted leaf nodes. We are working on fixing the client bug.

The concerning part here is that the sync gateway seems to get into some wedged state, where the only recourse is to restart it, even after the problematic clients have stopped sending it garbage revisions.

Once the sync gateway is restarted, it behaves normally again (presuming the buggy clients remain offline).

Just filing this ticket to try and help make the SG product better -- we know we have some issues with our client that we need to work out with conflict resolution handling, and we are hopeful that resolving those will help keep the SG from tipping over...

Let me know if there is any other information that may prove useful or desirable, and I can see about providing it.

ajres commented 7 years ago

@cactus

Could you run tools/sgcollect_info and attach the results to this ticket.

For the best data collection go 1.7.1 (same as SG binary build) should be installed on the SG server.

Here is a guide to using sgcollect_info

dropwhile commented 7 years ago

@ajres do you want that collection done when sync gateway is using 100% cpu, when it is idle, or both?

dropwhile commented 7 years ago

I took an 'idle' collection just now (on the off chance you want that too). We will try and reproduce the weird behavior again tomorrow.

dropwhile commented 7 years ago

Alas, we weren't able to reproduce it today. We will hopefully have time to try again next tuesday.

ajres commented 7 years ago

@cactus sorry, yes we'll need the sgcollect_info archive generated when SG is using 100% CPU

dropwhile commented 7 years ago

@ajres I managed to capture the sgcollect_info when cpu was using 100% cpu. We hammered it with some replication traffic for a while, then killed nginx in front of it so it wasn't getting any more traffic. CPU remained pegged, and memory kept being gobbled until the box ran out and the OOM killer reaped the process. A bit before reaping, I managed to get the data collection run.

Here is the file

dropwhile commented 7 years ago

Here is another one where the sync gw was 100% cpu for a while. file

ajres commented 7 years ago

From the pprof profiles taken when the cpu was at 100% the issues appear to be in the in the json.Marshall and json.Unmarshall of docs. In one heap profile it shows a client using WebSocket pull replication.

@cactus, is it possible that the problematic clients are configured to use WebSockets, or is that standard for all of your CBL clients?

dropwhile commented 7 years ago

I believe it is standard for all our CBL clients (ios/swift/couchbase lite). We do have one process using the _change http endpoint to stuff changes into a database, and that uses non ws continuous feed (python w/requests library). We also do proxy through nginx, but for the most part have no issues there.

Here is a snippet of our config for nginx, just in case it is interesting/useful:

upstream sync_gateway {
    server 127.0.0.1:4984 fail_timeout=0;
    # disable backend keepalive for sync gateway, to avoid "protocol upgrade" Go exception
    # with some versions of sync gateway.
    # keepalive 30;
}

server {
    listen 443 ssl http2;
    # <snip>
    location / {
        proxy_set_header    Host $host;
        proxy_set_header    X-Real-IP $remote_addr;
        proxy_set_header    X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header    X-Forwarded-Proto $scheme;
        proxy_pass_header   Accept;
        proxy_pass_header   Server;

        proxy_http_version 1.1;
        proxy_set_header Connection "upgrade";
        proxy_set_header Upgrade $http_upgrade;

        proxy_pass          http://sync_gateway;
        proxy_buffering     off;
        proxy_read_timeout  360s;
        keepalive_requests  1000;
        keepalive_timeout   360s;
        proxy_redirect default;
    }
}

Our clients also do local replication between peers using the sync protocol, as well as to sync gw. We did find that our clients were making excessive leaf revisions, due to a bug in some of our conflict resolution code -- so many leaf revs in fact that for a couple docs we would occasionally hit the "max size" and the replication of that document would start to fail.

ajres commented 7 years ago

@cactus, I'm working on this issue in the current sprint.

Unfortunately the attached profiles are no longer available for download, are you able to upload new copies?

There are a number of fixes on the current repo master branch that may resolve this issue. Can you try building from source and testing against that or test against the 1.5.0 binary release of SG when it is released.

dropwhile commented 7 years ago

@ajres Unfortunately I am no longer working on the project in question (contract gig ended), which is why the attached profiles went away. I do have the files in a backup, and will put them back up (see new links below), so you can fetch them. Just be aware that I no longer have the capacity to test against the same environment where the issue occurred, so the profiles are about all I can provide at this point.

sg_collect_info_dev-01-2017-03-30-23-3624-high-cpu.zip sg_collect_info_dev-01-2017-03-31-22-5015-high-cpu.zip

ajres commented 7 years ago

There is not enough information on this ticket in it's current state, as it's not going to be possible to get further details I'm closing this ticket.

Reopen if the issue is reproducible in the original project context.

teddybee commented 6 years ago

I have the same issue when I install the server and a gateway in docker. Until creating a bucket there is no issue, but as soon as I start the gateway. I got constant 300% cpu usage(assuming 3 core fully working). If I stop the gateway and restart the server's container the high cpu usage is back again. The only way to eliminate this is deleting the bucket from the server and recreate it again... The bucket stucks or falls back to warmup state. Two thread flooding the host this time when i type the top: 5396 xxx 20 0 2523312 157240 20920 S 171.9 8.3 10:24.97 beam.smp 5360 xxx 20 0 1719036 342464 5456 S 105.9 18.0 7:18.56 beam.smp

I am using the 1.5.1 community edition of couchbase and the latest docker. I put the containers on the same docker network. With windows version of sync gateway there is no such high cpu issue, but it can't find the database on the linux server eighter. 404 no such database "db"

couchbase / sync_gateway