cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.01k stars 3.79k forks source link

distsql: query gets stuck #16875

Closed mattiasmak closed 7 years ago

mattiasmak commented 7 years ago

We have a pretty small cluster. Out problem is that it goes down once every few days and we need to kill (killall -9 cockroach) and restart the cluster manually. I have previously had similar problems with cockroach, but it continues to be very unstable.

I think the best way to find out what happens is to supply you the logs.

Hopefully this gives you some help in making cockroach a more robust database.

LOGS: ro-db1.tar.gz ro-db2.tar.gz ro-db3.tar.gz


I170705 06:47:09.341881 1 util/log/clog.go:1007 [config] file created at: 2017/07/05 06:47:09 I170705 06:47:09.341881 1 util/log/clog.go:1007 [config] running on machine: RO-DB-001 I170705 06:47:09.341881 1 util/log/clog.go:1007 [config] binary: CockroachDB CCL beta-20170413-dirty (linux amd64, built 2017/04/13 21:06:17, go1.8.1) I170705 06:47:09.341881 1 util/log/clog.go:1007 [config] arguments: [cockroach start --insecure --advertise-host=10.135.17.161 --host=10.135.17.161]

I170705 06:47:12.135272 1 util/log/clog.go:1007 [config] file created at: 2017/07/05 06:47:12 I170705 06:47:12.135272 1 util/log/clog.go:1007 [config] running on machine: RO-DB-002 I170705 06:47:12.135272 1 util/log/clog.go:1007 [config] binary: CockroachDB CCL beta-20170413-dirty (linux amd64, built 2017/04/13 21:06:17, go1.8.1) I170705 06:47:12.135272 1 util/log/clog.go:1007 [config] arguments: [cockroach start --insecure --advertise-host=10.135.1.165 --join=10.135.17.161:26257 --host=10.135.1.165]

I170705 06:47:14.053046 1 util/log/clog.go:1007 [config] file created at: 2017/07/05 06:47:14 I170705 06:47:14.053046 1 util/log/clog.go:1007 [config] running on machine: RO-DB-003 I170705 06:47:14.053046 1 util/log/clog.go:1007 [config] binary: CockroachDB CCL beta-20170413-dirty (linux amd64, built 2017/04/13 21:06:17, go1.8.1) I170705 06:47:14.053046 1 util/log/clog.go:1007 [config] arguments: [cockroach start --insecure --advertise-host=10.135.15.39 --join=10.135.17.161:26257 --host=10.135.15.39]

bdarnell commented 7 years ago

You're running a pre-release beta; you should upgrade to version 1.0.2 or 1.0.3 (which will be released today).

I170628 12:25:49.455867 1 cli/start.go:390  CockroachDB CCL beta-20170413-dirty (linux amd64, built 2017/04/13 21:06:17, go1.8.1)

Looking at one node, we can see that things are stable for a while at 1.4GB RSS and ~80 goroutines, then both numbers rise until the process is unable to allocate more memory and dies:

  17625:I170705 02:54:07.068211 58 server/status/runtime.go:227  [n1] runtime stats: 1.4 GiB RSS, 79 goroutines, 95 MiB/143 MiB/312 MiB GO alloc/idle/total, 515 MiB/673 MiB CGO alloc/total, 1251.06cgo/sec, 0.41/0.02 %(u/s)time, 0.00 %gc (8x)
  17636:I170705 02:54:17.065483 58 server/status/runtime.go:227  [n1] runtime stats: 1.4 GiB RSS, 79 goroutines, 155 MiB/84 MiB/312 MiB GO alloc/idle/total, 517 MiB/673 MiB CGO alloc/total, 1193.73cgo/sec, 0.45/0.01 %(u/s)time, 0.00 %gc (7x)
  17677:I170705 02:54:27.080110 58 server/status/runtime.go:227  [n1] runtime stats: 1.6 GiB RSS, 81 goroutines, 366 MiB/93 MiB/509 MiB GO alloc/idle/total, 518 MiB/673 MiB CGO alloc/total, 379210.44cgo/sec, 1.56/0.06 %(u/s)time, 0.00 %gc (16x)
  17689:I170705 02:54:37.355539 58 server/status/runtime.go:227  [n1] runtime stats: 1.8 GiB RSS, 130 goroutines, 459 MiB/216 MiB/732 MiB GO alloc/idle/total, 520 MiB/673 MiB CGO alloc/total, 34755.38cgo/sec, 0.82/0.32 %(u/s)time, 0.00 %gc (3x)
  17703:I170705 02:54:48.846425 58 server/status/runtime.go:227  [n1] runtime stats: 1.8 GiB RSS, 248 goroutines, 320 MiB/356 MiB/732 MiB GO alloc/idle/total, 520 MiB/673 MiB CGO alloc/total, 37.21cgo/sec, 0.14/0.50 %(u/s)time, 0.00 %gc (0x)
  17739:I170705 02:54:59.430478 58 server/status/runtime.go:227  [n1] runtime stats: 1.8 GiB RSS, 1067 goroutines, 317 MiB/332 MiB/732 MiB GO alloc/idle/total, 520 MiB/673 MiB CGO alloc/total, 33.17cgo/sec, 0.27/0.44 %(u/s)time, 0.00 %gc (1x)
  17747:I170705 02:55:07.268946 58 server/status/runtime.go:227  [n1] runtime stats: 1.8 GiB RSS, 1242 goroutines, 314 MiB/340 MiB/732 MiB GO alloc/idle/total, 520 MiB/673 MiB CGO alloc/total, 136.74cgo/sec, 0.25/0.50 %(u/s)time, 0.00 %gc (0x)
  17759:I170705 02:55:17.490386 58 server/status/runtime.go:227  [n1] runtime stats: 1.8 GiB RSS, 1227 goroutines, 327 MiB/331 MiB/732 MiB GO alloc/idle/total, 520 MiB/673 MiB CGO alloc/total, 107.53cgo/sec, 0.17/0.51 %(u/s)time, 0.00 %gc (0x)
  17771:I170705 02:55:28.918161 58 server/status/runtime.go:227  [n1] runtime stats: 1.8 GiB RSS, 1313 goroutines, 385 MiB/273 MiB/732 MiB GO alloc/idle/total, 520 MiB/673 MiB CGO alloc/total, 65.07cgo/sec, 0.19/0.48 %(u/s)time, 0.00 %gc (0x)
  17811:I170705 02:55:37.120617 58 server/status/runtime.go:227  [n1] runtime stats: 1.8 GiB RSS, 1523 goroutines, 398 MiB/257 MiB/732 MiB GO alloc/idle/total, 520 MiB/673 MiB CGO alloc/total, 8.35cgo/sec, 0.28/0.48 %(u/s)time, 0.00 %gc (1x)

It looks like there are some large queries being run:

I170705 02:54:24.942502 10831076 sql/mon/mem_usage.go:509  [client=10.135.42.140:13569,user=root,n1] txn: memory usage increases to 8.0 MiB (+119)
I170705 02:54:25.337281 10830043 sql/mon/mem_usage.go:509  [client=10.135.42.140:13565,user=root,n1] sql: memory usage increases to 128 MiB (+10240)
I170705 02:54:25.395227 10830043 sql/mon/mem_usage.go:509  [client=10.135.42.140:13565,user=root,n1] txn: memory usage increases to 129 MiB (+896112)
I170705 02:54:25.445460 10831076 sql/mon/mem_usage.go:509  [client=10.135.42.140:13569,user=root,n1] txn: memory usage increases to 16 MiB (+119)
I170705 02:54:25.514976 134 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 10.135.1.165:26257: getsockopt: connection refused"; Reconnecting to {10.135.1.165:26257 <nil>}
I170705 02:54:25.893212 10831076 sql/mon/mem_usage.go:509  [client=10.135.42.140:13569,user=root,n1] txn: memory usage increases to 32 MiB (+1080)
I170705 02:54:26.262442 10830043 sql/mon/mem_usage.go:509  [client=10.135.42.140:13565,user=root,n1] sql: memory usage increases to 256 MiB (+10240)
I170705 02:54:26.494414 134 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 10.135.1.165:26257: getsockopt: connection refused"; Reconnecting to {10.135.1.165:26257 <nil>}
I170705 02:54:26.673534 10830043 sql/mon/mem_usage.go:509  [client=10.135.42.140:13565,user=root,n1] txn: memory usage increases to 256 MiB (+165)
I170705 02:54:26.891802 10831076 sql/mon/mem_usage.go:509  [client=10.135.42.140:13569,user=root,n1] txn: memory usage increases to 64 MiB (+119)

How much memory is available to these processes? Is it possible they're swapping?

mattiasmak commented 7 years ago

Ok, I see... There's actually not much going on. Mostly simple writes and then one query that runs once every 6 hours...

Perhaps this query (still rather simple) is what's casing it. Will investigate it. But I'm having a hard time believing this could crash servers.

Are there any tools that I can use in order to pinpoint the problem?

SELECT r.*, v.name as venue_name, w.os as os, CAST(r.duration_signal AS INT) as duration_signal_int 
FROM raw_visits r
INNER JOIN venues v ON r.venue_id=v.id
INNER JOIN clients c ON v.client_id=c.id
INNER JOIN devices d ON r.device_id=d.id
INNER JOIN whitelist w ON d.prefix=w.prefix
WHERE r.first_signal>='${startdate}' AND c.id = 1
bdarnell commented 7 years ago

A 5-table join could definitely run into memory issues, depending on how much data is in those tables. The EXPLAIN command can show you the query plan and might help identify opportunities for improvement. Our query planner is not yet very smart and sometimes you have to help it by reordering things or giving it other hints.

How much memory is available to these processes? Is it possible they're swapping?

What about this question?

vivekmenezes commented 7 years ago

We're implementing a scheme for offloading to external storage for queries that use up a lot of memory: https://github.com/cockroachdb/cockroach/issues/15206

mattiasmak commented 7 years ago

Each node in the cluster has 2GB of memory. Will add some monitoring on memory usage. Will also check the query plan. Thanks.

mattiasmak commented 7 years ago

I created a new cluster with newest version of CockroachDB. Same thing happens.

Can you guide me how to solve this? This time a noticed node 3 of the cluster had the memory usage build up until it crashed. The 2 other nodes were stable until node 3 crashed.

I attach the query plan and the chart of the memory usage.

Beside from the query with joins the system does two more: One UPSERT and one INSERT. These two is called perhaps 5-10 times/second.

What I'm asking for is how to solve this mystery. We migrated database from MySQL to Cockroach and there were no issues of this kind on MySQL.

graph query
petermattis commented 7 years ago

@RaduBerinde Can you give some guidance on how to debug the memory issue going on with this query?

mattiasmak commented 7 years ago

I can't be sure it's this query. But the rest is just simple inserts. Table 'raw_visits' as 700.000 rows and table 'devices' 300.000 rows. The rest is pretty small 10-1000 rows.

Do you think this query can do this? It's an export-data-query that runs once every hour... However the graph indicates that the memory rises once every hour. But why doesn't it release the memory used when query is done?

mattiasmak commented 7 years ago

I attach screens from the three nodes CPU and Memory usage last 24 hours. It looks like node 3 runs all queries, then it crashes and node 1 takes over and the same pattern repeats.

Node 1

node 1

Node 2

node 2

Node 3

node 3
mattiasmak commented 7 years ago

I made it open: http://67.207.72.227:8080/

RaduBerinde commented 7 years ago

A large query would not explain the large and increasing number of goroutines (1500 from a baseline of 80).. I would get a snapshot of the goroutines from a host while its memory usage is significantly increased (can be done from the page at host-ip:8080/debug, grep for "goroutines").

We cannot do this from our side unless we have public IPs for all the hosts, and the server.remote_debugging.mode cluster setting is any (can be set from a SQL session using SET CLUSTER SETTING server.remote_debugging.mode = 'any'). Note that opening this up has security implications.

mattiasmak commented 7 years ago

No problem. This is an "near production" database, so I don't mind opening it up for a while.

The three nodes ip:

46.101.157.129 46.101.183.149 207.154.201.151

Load balancer: 67.207.72.227

RaduBerinde commented 7 years ago

Great, thanks! Can you tell me which is which (the admin UI shows internal IPs)? Are they listed in the order .4.109, .48.94, .53.95?

RaduBerinde commented 7 years ago

The goroutine count isn't very high on either host right now. I looked at the goroutines and there is a distSQL query that seems to be stuck (could be #14948)

goroutine-207.154.201.151.txt

I enabled trace.debug.enable and will look at it a bit later when hopefully the mem usage and goroutine count goes higher.

RaduBerinde commented 7 years ago

@mattiasmak - from the client that is running this workload, can you tell if a query never returns? If a query gets stuck, would we still get a new query 6 hours later? (that would explain the increasing usage).

RaduBerinde commented 7 years ago

I tried to run the query above with a few recent startdates and it completes fine (and it doesn't seem to leave any goroutines around).

RaduBerinde commented 7 years ago

Looking at the goroutines more closely, I don't think this is #14948. I don't see any outbox stuck trying to flush. I see some gorouties stuck on sending to RowChannels and others stuck on receiving, which is strange.

mattiasmak commented 7 years ago

Right now memory on: 46.101.183.149 is crazy high Are there things I should do? I don't want to touch anything at the moment.

tbg commented 7 years ago

@mattiasmak it's quite early here, but perhaps you can capture ./cockroach debug zip for the offending node? Note that that zip file may contain semi-sensitive information, so you may want to share it privately.

tbg commented 7 years ago

Actually I just took a look and at least the mem stats on the ui don't look horrible, can you point me at what looks out of the ordinary?

tbg commented 7 years ago

What I am seeing is that the goroutine count seems to be slowly increasing over the last week. That is unexpected. Definitely things in there that are running for more than a day. @RaduBerinde, looks like your kind of jam?

image

image

tbg commented 7 years ago

image

tbg commented 7 years ago

image image image image image

mattiasmak commented 7 years ago
skarmavbild 2017-07-16 kl 16 13 26
tbg commented 7 years ago

@RaduBerinde Note the sql txn implicit span -- it gets a bunch of leases, then heads into DistSQL and never comes back:


SELECT raw_visits.*, venues."name" AS
venue_name, whitelist.os AS os,
CAST(raw_visits.duration_signal AS INT) AS duration_signal_int
FROM raw_visits, venues, devices, whitelist
WHERE (((raw_visits.device_id = devices.id) AND
(whitelist.prefix = devices.prefix)) AND
(raw_visits.venue_id = venues.id)) AND
(raw_visits.first_signal >= '2017-07-15')

requests.txt

tbg commented 7 years ago

I reproduced that that query isn't coming back:

$ ./cockroach sql --insecure --host 46.101.183.149 <<EOF
SET DATABASE=RO;SELECT raw_visits.*, venues."name" AS venue_name, whitelist.os AS os, CAST(raw_visits.duration_signal AS INT) AS duration_signal_int FROM raw_visits, venues, devices, whitelist WHERE (((raw_visits.device_id = devices.id) AND (whitelist.prefix = devices.prefix)) AND (raw_visits.venue_id = venues.id)) AND (raw_visits.first_signal >= '2017-07-15');
EOF

If you can afford to try it out, try set cluster setting sql.defaults.distsql = 'off' and see whether the query works (it may instead run very slowly and perhaps overload the cluster). If it does and you want to keep DistSQL off for now until we fix the bug, give the cluster a rolling restart to get rid of the stuck operations. Then, see if you still have memory problems (perhaps what I found is one bug, but there may be others!).

I'm confident that with the information above, @RaduBerinde can work out what's going wrong.

tbg commented 7 years ago

Ok, I have this reproduced locally. @RaduBerinde, will send that your way.

tbg commented 7 years ago

Btw, @mattiasmak, I ran the query locally with DistSQL=off and it's not pretty. It eats north of 1GB of RAM. So you probably don't even want to try that.

mattiasmak commented 7 years ago

Ok. Hope you guys find something useful out of this. Do you at this point know if I can adjust the query in order to avoid server crash?

RaduBerinde commented 7 years ago

Thanks Tobi!

This is a plan that gets stuck: image

I believe that we have a deadlock between the two hashjoiner shards. The hash joiners read from both sides (alternating), so if one side is stalled when receiving, that stalls the entire hash joiner. Now suppose that for devices we get a batch of rows that all go to the left, and that for whitelist we get a batch of rows that all go to the right. If these batches are big enough to fill the RowChannel buffers, we could be stuck with the left hashjoiner waiting for a result from whitelist and the right hashjoiner waiting for a result form devices (and the table reader stuck as well). I have confirmed from the goroutine stacks that table/join readers are stuck trying to Push and hash joiners are stuck trying to Recv.

It's harder to see how this happened with the 1.0 hashjoiner implementation which always reads from a fixed side first, but it's possible (one hashjoiner needs to be in the probe stage and one in the buffering stage).

The fix for this case is to change the hash joiner to read concurrently from both sides and buffer as necessary. However, this problem is more general and will require fixes for ordered synchronizers and mergejoiners; I will file a separate "master" issue for these.

mattiasmak commented 7 years ago

I have a question about the cluster we run. Did you do something to make the performance worse? Now I can't run the query at all. Can you instruct me how to switch back from debugging? Or is it not changed? (Perhaps this should be discussed in other place)

RaduBerinde commented 7 years ago

No, I don't think we changed anything. You could try restaring the nodes. We are working on the problem we found here (it's my highest priority) and we hope to have it fixed soon. Sorry, I don't think there is a workaround (other than disabling distsql which makes the queries slower).

RaduBerinde commented 7 years ago

Fixed by #17328.