Closed mattiasmak closed 7 years ago
You're running a pre-release beta; you should upgrade to version 1.0.2 or 1.0.3 (which will be released today).
I170628 12:25:49.455867 1 cli/start.go:390 CockroachDB CCL beta-20170413-dirty (linux amd64, built 2017/04/13 21:06:17, go1.8.1)
Looking at one node, we can see that things are stable for a while at 1.4GB RSS and ~80 goroutines, then both numbers rise until the process is unable to allocate more memory and dies:
17625:I170705 02:54:07.068211 58 server/status/runtime.go:227 [n1] runtime stats: 1.4 GiB RSS, 79 goroutines, 95 MiB/143 MiB/312 MiB GO alloc/idle/total, 515 MiB/673 MiB CGO alloc/total, 1251.06cgo/sec, 0.41/0.02 %(u/s)time, 0.00 %gc (8x)
17636:I170705 02:54:17.065483 58 server/status/runtime.go:227 [n1] runtime stats: 1.4 GiB RSS, 79 goroutines, 155 MiB/84 MiB/312 MiB GO alloc/idle/total, 517 MiB/673 MiB CGO alloc/total, 1193.73cgo/sec, 0.45/0.01 %(u/s)time, 0.00 %gc (7x)
17677:I170705 02:54:27.080110 58 server/status/runtime.go:227 [n1] runtime stats: 1.6 GiB RSS, 81 goroutines, 366 MiB/93 MiB/509 MiB GO alloc/idle/total, 518 MiB/673 MiB CGO alloc/total, 379210.44cgo/sec, 1.56/0.06 %(u/s)time, 0.00 %gc (16x)
17689:I170705 02:54:37.355539 58 server/status/runtime.go:227 [n1] runtime stats: 1.8 GiB RSS, 130 goroutines, 459 MiB/216 MiB/732 MiB GO alloc/idle/total, 520 MiB/673 MiB CGO alloc/total, 34755.38cgo/sec, 0.82/0.32 %(u/s)time, 0.00 %gc (3x)
17703:I170705 02:54:48.846425 58 server/status/runtime.go:227 [n1] runtime stats: 1.8 GiB RSS, 248 goroutines, 320 MiB/356 MiB/732 MiB GO alloc/idle/total, 520 MiB/673 MiB CGO alloc/total, 37.21cgo/sec, 0.14/0.50 %(u/s)time, 0.00 %gc (0x)
17739:I170705 02:54:59.430478 58 server/status/runtime.go:227 [n1] runtime stats: 1.8 GiB RSS, 1067 goroutines, 317 MiB/332 MiB/732 MiB GO alloc/idle/total, 520 MiB/673 MiB CGO alloc/total, 33.17cgo/sec, 0.27/0.44 %(u/s)time, 0.00 %gc (1x)
17747:I170705 02:55:07.268946 58 server/status/runtime.go:227 [n1] runtime stats: 1.8 GiB RSS, 1242 goroutines, 314 MiB/340 MiB/732 MiB GO alloc/idle/total, 520 MiB/673 MiB CGO alloc/total, 136.74cgo/sec, 0.25/0.50 %(u/s)time, 0.00 %gc (0x)
17759:I170705 02:55:17.490386 58 server/status/runtime.go:227 [n1] runtime stats: 1.8 GiB RSS, 1227 goroutines, 327 MiB/331 MiB/732 MiB GO alloc/idle/total, 520 MiB/673 MiB CGO alloc/total, 107.53cgo/sec, 0.17/0.51 %(u/s)time, 0.00 %gc (0x)
17771:I170705 02:55:28.918161 58 server/status/runtime.go:227 [n1] runtime stats: 1.8 GiB RSS, 1313 goroutines, 385 MiB/273 MiB/732 MiB GO alloc/idle/total, 520 MiB/673 MiB CGO alloc/total, 65.07cgo/sec, 0.19/0.48 %(u/s)time, 0.00 %gc (0x)
17811:I170705 02:55:37.120617 58 server/status/runtime.go:227 [n1] runtime stats: 1.8 GiB RSS, 1523 goroutines, 398 MiB/257 MiB/732 MiB GO alloc/idle/total, 520 MiB/673 MiB CGO alloc/total, 8.35cgo/sec, 0.28/0.48 %(u/s)time, 0.00 %gc (1x)
It looks like there are some large queries being run:
I170705 02:54:24.942502 10831076 sql/mon/mem_usage.go:509 [client=10.135.42.140:13569,user=root,n1] txn: memory usage increases to 8.0 MiB (+119)
I170705 02:54:25.337281 10830043 sql/mon/mem_usage.go:509 [client=10.135.42.140:13565,user=root,n1] sql: memory usage increases to 128 MiB (+10240)
I170705 02:54:25.395227 10830043 sql/mon/mem_usage.go:509 [client=10.135.42.140:13565,user=root,n1] txn: memory usage increases to 129 MiB (+896112)
I170705 02:54:25.445460 10831076 sql/mon/mem_usage.go:509 [client=10.135.42.140:13569,user=root,n1] txn: memory usage increases to 16 MiB (+119)
I170705 02:54:25.514976 134 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 10.135.1.165:26257: getsockopt: connection refused"; Reconnecting to {10.135.1.165:26257 <nil>}
I170705 02:54:25.893212 10831076 sql/mon/mem_usage.go:509 [client=10.135.42.140:13569,user=root,n1] txn: memory usage increases to 32 MiB (+1080)
I170705 02:54:26.262442 10830043 sql/mon/mem_usage.go:509 [client=10.135.42.140:13565,user=root,n1] sql: memory usage increases to 256 MiB (+10240)
I170705 02:54:26.494414 134 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 10.135.1.165:26257: getsockopt: connection refused"; Reconnecting to {10.135.1.165:26257 <nil>}
I170705 02:54:26.673534 10830043 sql/mon/mem_usage.go:509 [client=10.135.42.140:13565,user=root,n1] txn: memory usage increases to 256 MiB (+165)
I170705 02:54:26.891802 10831076 sql/mon/mem_usage.go:509 [client=10.135.42.140:13569,user=root,n1] txn: memory usage increases to 64 MiB (+119)
How much memory is available to these processes? Is it possible they're swapping?
Ok, I see... There's actually not much going on. Mostly simple writes and then one query that runs once every 6 hours...
Perhaps this query (still rather simple) is what's casing it. Will investigate it. But I'm having a hard time believing this could crash servers.
Are there any tools that I can use in order to pinpoint the problem?
SELECT r.*, v.name as venue_name, w.os as os, CAST(r.duration_signal AS INT) as duration_signal_int
FROM raw_visits r
INNER JOIN venues v ON r.venue_id=v.id
INNER JOIN clients c ON v.client_id=c.id
INNER JOIN devices d ON r.device_id=d.id
INNER JOIN whitelist w ON d.prefix=w.prefix
WHERE r.first_signal>='${startdate}' AND c.id = 1
A 5-table join could definitely run into memory issues, depending on how much data is in those tables. The EXPLAIN
command can show you the query plan and might help identify opportunities for improvement. Our query planner is not yet very smart and sometimes you have to help it by reordering things or giving it other hints.
How much memory is available to these processes? Is it possible they're swapping?
What about this question?
We're implementing a scheme for offloading to external storage for queries that use up a lot of memory: https://github.com/cockroachdb/cockroach/issues/15206
Each node in the cluster has 2GB of memory. Will add some monitoring on memory usage. Will also check the query plan. Thanks.
I created a new cluster with newest version of CockroachDB. Same thing happens.
Can you guide me how to solve this? This time a noticed node 3 of the cluster had the memory usage build up until it crashed. The 2 other nodes were stable until node 3 crashed.
I attach the query plan and the chart of the memory usage.
Beside from the query with joins the system does two more: One UPSERT and one INSERT. These two is called perhaps 5-10 times/second.
What I'm asking for is how to solve this mystery. We migrated database from MySQL to Cockroach and there were no issues of this kind on MySQL.
@RaduBerinde Can you give some guidance on how to debug the memory issue going on with this query?
I can't be sure it's this query. But the rest is just simple inserts. Table 'raw_visits' as 700.000 rows and table 'devices' 300.000 rows. The rest is pretty small 10-1000 rows.
Do you think this query can do this? It's an export-data-query that runs once every hour... However the graph indicates that the memory rises once every hour. But why doesn't it release the memory used when query is done?
I attach screens from the three nodes CPU and Memory usage last 24 hours. It looks like node 3 runs all queries, then it crashes and node 1 takes over and the same pattern repeats.
Node 1
Node 2
Node 3
I made it open: http://67.207.72.227:8080/
A large query would not explain the large and increasing number of goroutines (1500 from a baseline of 80).. I would get a snapshot of the goroutines from a host while its memory usage is significantly increased (can be done from the page at host-ip:8080/debug
, grep for "goroutines").
We cannot do this from our side unless we have public IPs for all the hosts, and the server.remote_debugging.mode
cluster setting is any
(can be set from a SQL session using SET CLUSTER SETTING server.remote_debugging.mode = 'any'
). Note that opening this up has security implications.
No problem. This is an "near production" database, so I don't mind opening it up for a while.
The three nodes ip:
46.101.157.129 46.101.183.149 207.154.201.151
Load balancer: 67.207.72.227
Great, thanks! Can you tell me which is which (the admin UI shows internal IPs)? Are they listed in the order .4.109, .48.94, .53.95?
The goroutine count isn't very high on either host right now. I looked at the goroutines and there is a distSQL query that seems to be stuck (could be #14948)
I enabled trace.debug.enable
and will look at it a bit later when hopefully the mem usage and goroutine count goes higher.
@mattiasmak - from the client that is running this workload, can you tell if a query never returns? If a query gets stuck, would we still get a new query 6 hours later? (that would explain the increasing usage).
I tried to run the query above with a few recent startdates and it completes fine (and it doesn't seem to leave any goroutines around).
Looking at the goroutines more closely, I don't think this is #14948. I don't see any outbox stuck trying to flush. I see some gorouties stuck on sending to RowChannels and others stuck on receiving, which is strange.
Right now memory on: 46.101.183.149 is crazy high Are there things I should do? I don't want to touch anything at the moment.
@mattiasmak it's quite early here, but perhaps you can capture ./cockroach debug zip
for the offending node? Note that that zip file may contain semi-sensitive information, so you may want to share it privately.
Actually I just took a look and at least the mem stats on the ui don't look horrible, can you point me at what looks out of the ordinary?
What I am seeing is that the goroutine count seems to be slowly increasing over the last week. That is unexpected. Definitely things in there that are running for more than a day. @RaduBerinde, looks like your kind of jam?
@RaduBerinde Note the sql txn implicit
span -- it gets a bunch of leases, then heads into DistSQL and never comes back:
SELECT raw_visits.*, venues."name" AS
venue_name, whitelist.os AS os,
CAST(raw_visits.duration_signal AS INT) AS duration_signal_int
FROM raw_visits, venues, devices, whitelist
WHERE (((raw_visits.device_id = devices.id) AND
(whitelist.prefix = devices.prefix)) AND
(raw_visits.venue_id = venues.id)) AND
(raw_visits.first_signal >= '2017-07-15')
I reproduced that that query isn't coming back:
$ ./cockroach sql --insecure --host 46.101.183.149 <<EOF
SET DATABASE=RO;SELECT raw_visits.*, venues."name" AS venue_name, whitelist.os AS os, CAST(raw_visits.duration_signal AS INT) AS duration_signal_int FROM raw_visits, venues, devices, whitelist WHERE (((raw_visits.device_id = devices.id) AND (whitelist.prefix = devices.prefix)) AND (raw_visits.venue_id = venues.id)) AND (raw_visits.first_signal >= '2017-07-15');
EOF
If you can afford to try it out, try set cluster setting sql.defaults.distsql = 'off'
and see whether the query works (it may instead run very slowly and perhaps overload the cluster). If it does and you want to keep DistSQL off for now until we fix the bug, give the cluster a rolling restart to get rid of the stuck operations. Then, see if you still have memory problems (perhaps what I found is one bug, but there may be others!).
I'm confident that with the information above, @RaduBerinde can work out what's going wrong.
Ok, I have this reproduced locally. @RaduBerinde, will send that your way.
Btw, @mattiasmak, I ran the query locally with DistSQL=off and it's not pretty. It eats north of 1GB of RAM. So you probably don't even want to try that.
Ok. Hope you guys find something useful out of this. Do you at this point know if I can adjust the query in order to avoid server crash?
Thanks Tobi!
This is a plan that gets stuck:
I believe that we have a deadlock between the two hashjoiner shards. The hash joiners read from both sides (alternating), so if one side is stalled when receiving, that stalls the entire hash joiner. Now suppose that for devices
we get a batch of rows that all go to the left, and that for whitelist
we get a batch of rows that all go to the right. If these batches are big enough to fill the RowChannel buffers, we could be stuck with the left hashjoiner waiting for a result from whitelist
and the right hashjoiner waiting for a result form devices
(and the table reader stuck as well). I have confirmed from the goroutine stacks that table/join readers are stuck trying to Push and hash joiners are stuck trying to Recv.
It's harder to see how this happened with the 1.0 hashjoiner implementation which always reads from a fixed side first, but it's possible (one hashjoiner needs to be in the probe stage and one in the buffering stage).
The fix for this case is to change the hash joiner to read concurrently from both sides and buffer as necessary. However, this problem is more general and will require fixes for ordered synchronizers and mergejoiners; I will file a separate "master" issue for these.
I have a question about the cluster we run. Did you do something to make the performance worse? Now I can't run the query at all. Can you instruct me how to switch back from debugging? Or is it not changed? (Perhaps this should be discussed in other place)
No, I don't think we changed anything. You could try restaring the nodes. We are working on the problem we found here (it's my highest priority) and we hope to have it fixed soon. Sorry, I don't think there is a workaround (other than disabling distsql which makes the queries slower).
Fixed by #17328.
We have a pretty small cluster. Out problem is that it goes down once every few days and we need to kill (killall -9 cockroach) and restart the cluster manually. I have previously had similar problems with cockroach, but it continues to be very unstable.
I think the best way to find out what happens is to supply you the logs.
Hopefully this gives you some help in making cockroach a more robust database.
LOGS: ro-db1.tar.gz ro-db2.tar.gz ro-db3.tar.gz
I170705 06:47:09.341881 1 util/log/clog.go:1007 [config] file created at: 2017/07/05 06:47:09 I170705 06:47:09.341881 1 util/log/clog.go:1007 [config] running on machine: RO-DB-001 I170705 06:47:09.341881 1 util/log/clog.go:1007 [config] binary: CockroachDB CCL beta-20170413-dirty (linux amd64, built 2017/04/13 21:06:17, go1.8.1) I170705 06:47:09.341881 1 util/log/clog.go:1007 [config] arguments: [cockroach start --insecure --advertise-host=10.135.17.161 --host=10.135.17.161]
I170705 06:47:12.135272 1 util/log/clog.go:1007 [config] file created at: 2017/07/05 06:47:12 I170705 06:47:12.135272 1 util/log/clog.go:1007 [config] running on machine: RO-DB-002 I170705 06:47:12.135272 1 util/log/clog.go:1007 [config] binary: CockroachDB CCL beta-20170413-dirty (linux amd64, built 2017/04/13 21:06:17, go1.8.1) I170705 06:47:12.135272 1 util/log/clog.go:1007 [config] arguments: [cockroach start --insecure --advertise-host=10.135.1.165 --join=10.135.17.161:26257 --host=10.135.1.165]
I170705 06:47:14.053046 1 util/log/clog.go:1007 [config] file created at: 2017/07/05 06:47:14 I170705 06:47:14.053046 1 util/log/clog.go:1007 [config] running on machine: RO-DB-003 I170705 06:47:14.053046 1 util/log/clog.go:1007 [config] binary: CockroachDB CCL beta-20170413-dirty (linux amd64, built 2017/04/13 21:06:17, go1.8.1) I170705 06:47:14.053046 1 util/log/clog.go:1007 [config] arguments: [cockroach start --insecure --advertise-host=10.135.15.39 --join=10.135.17.161:26257 --host=10.135.15.39]