crate / crate

CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.
https://cratedb.com/product
Apache License 2.0
4.1k stars 563 forks source link

Statements stuck in sys.jobs #6383

Closed tellezb closed 6 years ago

tellezb commented 7 years ago

Environment: Crate 1.1.6 Amazon Linux AMI 4.9.38-16.33.amzn1.x86_64 6 AWS t2.small nodes, in 3 availability zones

Problem description: I posted this issue on Slack and was instructed to crate an issue here.

Seeing quite a few (over 100) entries in sys.jobs for statements with old start dates ... up to a week ago. The application which started the query is definitely not running and I don't seem to be able to KILL the job either (nothing happens).

Current 102 jobs. Killing a couple of them using KILL didn't change anything (the job still shows up in sys.jobs):

image

I've tried running a few of the queries found in sys.jobs and they all run in under a second. Nothing noteworthy is in the logs. Node CPU looks normal: image

I haven't restarted the nodes in attempt to maintain this state. What else should I look for? Thanks.

mfussenegger commented 7 years ago

Thanks for reporting

I remember that we fixed something along the lines of entries not properly clearing from sys.jogs a while ago. Would it be possible to upgrade to the latest version? (Currently 2.1)

tellezb commented 7 years ago

The queries vary and are against a few different tables. None of them have GROUP BY, some of them have ORDER BY, some of them have filters, none of them filter by pk.

339 entries in sys.operations image

Unfortunately, I cannot update the cluster to 2.1 quite yet.

rps-v commented 7 years ago

We are also seeing stuck jobs on simple SELECT statements. Can't even KILL them, no effect. The only solution seems to be to restart the cluster. This issue has been present on almost every version, we are now on 2.1.6 and still get them.

mfussenegger commented 7 years ago

@rps-v

Could you

a) Also post the output of select name, _node['name'], count(*) from sys.operations group by 1, 2; ? b) Confirm that on the client side no queries get stuck c) Provide a heap dump of one of the nodes which contain stuck operations ?

rps-v commented 7 years ago

@mfussenegger

Unfortunately the cluster was restarted recently, however I will report back when jobs get stuck again.

rps-v commented 7 years ago

@mfussenegger

Well that did not take long 😞 . One of the nodes in the cluster just ran out of heap (from 4gb to 7.9gb in 15 seconds, which seems way too fast and more like a bug).

The cluster became unresponsive until the node was restarted so I could not run the sys.operations query. Unfortunately I could not generate a heap dump because it was on the production cluster and had to fix it ASAP. However I sent you logs via email.

Thank you !!

rps-v commented 7 years ago

@mfussenegger

Here is the result for the sys.operations query:

ops

mfussenegger commented 7 years ago

One of the nodes in the cluster just ran out of heap (from 4gb to 7.9gb in 15 seconds, which seems way too fast and more like a bug

Going from 4gb to 7.9gb in 15 sec is not uncommon. But the logs indicate that the garbage collection couldn't free any memory:

[gc][old][151891][74] duration [7.4s], collections [1]/[7.4s], total [7.4s]/[4.5m], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [532.5mb]->[532.5mb]/[532.5mb]}{[survivor] [64.4mb]->[64.1mb]/[66.5mb]} {[old] [7.3gb]->[7.3gb]/[7.3gb]}

That's definitely an issue that should be caught by the circuit breaker.

I think this might be a different problem - maybe you could open up a separate issue for this?

Here is the result for the sys.operations query:

Any chance to get a heapdump from node s0?

rps-v commented 7 years ago

@mfussenegger

I tried to get a heapdump using jmap but it was running really slow and since this happened on the production cluster, had to abort it and restart the node. Will try again if queries get stuck.

Now everything seems fine (I did change the "write:wait_for_active_shards" setting to 1 for a few tables since we were getting some errors about not enough active shard copies).

Edit: Created a separate issue #6472

abrenneke commented 6 years ago

I currently have 719 rows stuck in sys.jobs, most from 4 days ago. CPU usage is 0, so it's not like they're doing anything.

mfussenegger commented 6 years ago

@SneakyMax

could you also report:

abrenneke commented 6 years ago

Yeah.

The result of the command above gives me:

name | _node['name'] | count(*)
distributed merge | James Tiberius Kirk | 715
distributing collect | James Tiberius Kirk | 715
collect | James Tiberius Kirk | 667
fetchContext | James Tiberius Kirk | 666
mergeOnHandler | S'chn T'gai Spock | 1
fetchContext | S'chn T'gai Spock | 166
nested-loop | S'chn T'gai Spock | 715
collect | S'chn T'gai Spock | 168
nl-merge | S'chn T'gai Spock | 1381

I'd give you crate logs for the errors too, but unfortunately our crate logs are full of thousands and thousands of lines of

[2017-11-20T22:11:13,253][WARN ][o.e.d.r.RestController   ] Content type detection for rest requests is deprecated. Specify the content type using the [Content-Type] header.
rps-v commented 6 years ago

@mfussenegger I finally managed to get a heap dump. Sent to your email.

mikethebeer commented 6 years ago

@SneakyMax Thx for the update.

I'd give you crate logs for the errors too, but unfortunately our crate logs are full of thousands and thousands of lines of

[2017-11-20T22:11:13,253][WARN ][o.e.d.r.RestController   ] Content type detection for rest requests is deprecated. Specify the content type using the [Content-Type] header.

Are you using the crate python driver? This deprecation warning is fixed in the python driver version >= 0.20.1. I really recommend to update to the latest driver since this deprecation warning seems to flood your logs which is not good for performance.

Regarding the issue the describes stucked queries listed in sys.jobs/operations could you check if you get any disconnect warnings of participating nodes in the cluster? A problem could be that you run into network partitions or an issue that a node hangs (just because of flooded logs?! )

abrenneke commented 6 years ago

@mikethebeer We're using the node-crate driver. It looks like Content-type was added oct 11th, we just hadn't upgraded yet.

That would be hilarious if this probably was because of flooded logs. I'll see if I can find anything after upgrading the driver.

mikethebeer commented 6 years ago

@SneakyMax ok, keep us informed if you see stucked queries again after the driver update. This was just an assumption that came to my mind first since I have the feeling that certain nodes become unresponsive and so also their queries get stuck.

mikethebeer commented 6 years ago

@SneakyMax Another parameter that you could look at is the number of shards in the cluster. As I remember correctly there was an issue with hanging queries in sys.jobs some time ago. In that scenario our customer got way to much shards assigned in the cluster. Could you check also for the nr. of shards and replicas that you have configured for your tables? Send us also the nr. of nodes you have in your cluster so that we could compare the relation to each.

seut commented 6 years ago

@tellezb @rps-v @SneakyMax is this still an issue for you?

abrenneke commented 6 years ago

@seut Yeah, currently have 39 statements in sys.jobs, none of them should be there. Crate 2.2.3.

abrenneke commented 6 years ago

re: @mikethebeer

rps-v commented 6 years ago

@seut Yes, we also have around 10 jobs stuck which can't be killed (on a 5 node cluster running Crate 2.1.6).

Here are the statements from those jobs:

SELECT id, severity, description, passed FROM sys.checks WHERE passed = FALSE ORDER BY severity DESC, id

select schema_name \|\| '.' \|\| table_name, min_lucene_version from sys.shards where min_lucene_version not like '6.%.%' order by 1
mxm commented 6 years ago

Thanks for updating the issue again with some more information. Previously we thought this could be related to more complex queries but it seems to affect also simple queries with sys tables, which led me to check some of the job logs code and how the entries are cleaned up.

While looking through the code which is executed before the job log entry is being removed, I found that we don't remove the job log in case of channel errors which can occur during sending the Rest response. This could be the cause for this issue because all the clients used here are Rest based.

I've opened a PR (#6718), so this should be in soon and you could test whether the issue still occurs.

mfussenegger commented 6 years ago

Closing this as we believe it's been fixed with 2.2.6.

Please re-open if anyone encounters this in 2.2.6+

abrenneke commented 6 years ago

We've upgraded now and stats.enabled set to true, will let you know if it happens.

abrenneke commented 6 years ago

@mfussenegger we've been running 2.3 for a week now and no extra entries in sys.jobs!

rps-v commented 6 years ago

We have also updated to 2.3.2 and still see a few stuck jobs which cannot be killed (a few UPDATE statements).

mfussenegger commented 6 years ago

We have also updated to 2.3.2 and still see a few stuck jobs which cannot be killed (a few UPDATE statements).

Were the UPDATE statements executed when nodes have been restarted? We've a fix in 2.3.3 in that case. And there is also a fix in 2.3.3 that should prevent queries on sys.checks or sys.shards from getting stuck.

If it's neither of that a heap dump would be very valuable in order for us to investigate further.

rps-v commented 6 years ago

The stuck jobs are from today, but the nodes haven't been restarted since about 2 days ago (a few crashed). The jobs are not from any sys table, so it shouldn't be that.

Is there any way to determine which node did the stuck queries execute on ?

mfussenegger commented 6 years ago

Is there any way to determine which node did the stuck queries execute on ?

The sys.operations table contains a _node['name'] and _node['id'] column. You can join sys.jobs and sys.operations on jobs.id = operations.job_id

rps-v commented 6 years ago

Something is strange... those 5 stuck UPDATE statements do not have entries in sys.operations (_node is null via JOIN). Also saw a stuck job from today with the following statement (could be the Admin UI ?):

SELECT table_name, schema_name, format('%s.%s', schema_name, table_name) AS fqn, _node['id'] AS node_id, state, routing_state, relocating_node, count(*), "primary", sum(num_docs), avg(num_docs), sum(size) FROM sys.shards GROUP BY table_name, schema_name, fqn, node_id, state, routing_state, relocating_node, "primary"
mfussenegger commented 6 years ago

Something is strange... those 5 stuck UPDATE statements do not have entries in sys.operations (_node is null via JOIN).

Can you show us the exact update statements? There are different execution plans for UPDATE statements. If the UPDATE has a WHERE pk_column = ? it's executed in a way that doesn't result in sys.operations entries.

Also saw a stuck job from today with the following statement (could be the Admin UI ?):

This is one of the queries I was referring to that shouldn't become stuck anymore in 2.3.3

rps-v commented 6 years ago

Ah, I see.

All the UPDATE queries are like this :

UPDATE "tracknamic"."vehicle" SET "group_id"=? , "make"=? , "model"=? , ...  WHERE  id = '6fe97252-deac-468b-8412-3364cbb36405'
mfussenegger commented 6 years ago

@rps-v

I've got a few more questions regarding the stuck update statements:

1) Are there any errors in the logs? 2) What kind of client are you using - is it postgres based and by chance are you using prepared statements? 3) Do you know if the client received responses for these statements, or did the client get stuck / run into a timeout?

rps-v commented 6 years ago

@mfussenegger

  1. No errors in logs (on all nodes)
  2. We are using HTTP client (node-crate)
  3. Unfortunately I do not have that information
mfussenegger commented 6 years ago

@rps-v

So far I'm not able to reproduce this, nor can I see anything in the code that looks suspicious.

Did the number of stuck entries grow since you initially reported this?

but the nodes haven't been restarted since about 2 days ago (a few crashed).

Did the crashes happen around the time when the queries got stuck?

rps-v commented 6 years ago

@mfussenegger

Since I reported this, no other stuck UPDATE queries, however there were some from sys tables (which you said is fixed in 2.3.3). No, the crashes occured 2 days before the queries got stuck.

We will try to update to 2.3.3 and report if it happens again. I think the issue can be closed for now.

mfussenegger commented 6 years ago

k thx, let us know once you've upgraded.