Closed tellezb closed 6 years ago
Thanks for reporting
sys.operations
?I remember that we fixed something along the lines of entries not properly clearing from sys.jogs
a while ago. Would it be possible to upgrade to the latest version? (Currently 2.1
)
The queries vary and are against a few different tables. None of them have GROUP BY, some of them have ORDER BY, some of them have filters, none of them filter by pk.
339 entries in sys.operations
Unfortunately, I cannot update the cluster to 2.1 quite yet.
We are also seeing stuck jobs on simple SELECT statements. Can't even KILL them, no effect. The only solution seems to be to restart the cluster. This issue has been present on almost every version, we are now on 2.1.6 and still get them.
@rps-v
Could you
a) Also post the output of select name, _node['name'], count(*) from sys.operations group by 1, 2;
?
b) Confirm that on the client side no queries get stuck
c) Provide a heap dump of one of the nodes which contain stuck operations ?
@mfussenegger
Unfortunately the cluster was restarted recently, however I will report back when jobs get stuck again.
@mfussenegger
Well that did not take long 😞 . One of the nodes in the cluster just ran out of heap (from 4gb to 7.9gb in 15 seconds, which seems way too fast and more like a bug).
The cluster became unresponsive until the node was restarted so I could not run the sys.operations query. Unfortunately I could not generate a heap dump because it was on the production cluster and had to fix it ASAP. However I sent you logs via email.
Thank you !!
@mfussenegger
Here is the result for the sys.operations
query:
One of the nodes in the cluster just ran out of heap (from 4gb to 7.9gb in 15 seconds, which seems way too fast and more like a bug
Going from 4gb to 7.9gb in 15 sec is not uncommon. But the logs indicate that the garbage collection couldn't free any memory:
[gc][old][151891][74] duration [7.4s], collections [1]/[7.4s], total [7.4s]/[4.5m], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [532.5mb]->[532.5mb]/[532.5mb]}{[survivor] [64.4mb]->[64.1mb]/[66.5mb]} {[old] [7.3gb]->[7.3gb]/[7.3gb]}
That's definitely an issue that should be caught by the circuit breaker.
I think this might be a different problem - maybe you could open up a separate issue for this?
Here is the result for the sys.operations query:
Any chance to get a heapdump from node s0
?
@mfussenegger
I tried to get a heapdump using jmap
but it was running really slow and since this happened on the production cluster, had to abort it and restart the node. Will try again if queries get stuck.
Now everything seems fine (I did change the "write:wait_for_active_shards" setting to 1 for a few tables since we were getting some errors about not enough active shard copies).
Edit: Created a separate issue #6472
I currently have 719 rows stuck in sys.jobs
, most from 4 days ago. CPU usage is 0, so it's not like they're doing anything.
@SneakyMax
could you also report:
select name, _node['name'], count(*) from sys.operations group by 1, 2
);Yeah.
SELECT stuff FROM ( SELECT stuff FROM table WHERE stuff GROUP BY stuff ) AS e INNER JOIN other_table WHERE stuff ORDER BY stuff)
The result of the command above gives me:
name | _node['name'] | count(*)
distributed merge | James Tiberius Kirk | 715
distributing collect | James Tiberius Kirk | 715
collect | James Tiberius Kirk | 667
fetchContext | James Tiberius Kirk | 666
mergeOnHandler | S'chn T'gai Spock | 1
fetchContext | S'chn T'gai Spock | 166
nested-loop | S'chn T'gai Spock | 715
collect | S'chn T'gai Spock | 168
nl-merge | S'chn T'gai Spock | 1381
I'd give you crate logs for the errors too, but unfortunately our crate logs are full of thousands and thousands of lines of
[2017-11-20T22:11:13,253][WARN ][o.e.d.r.RestController ] Content type detection for rest requests is deprecated. Specify the content type using the [Content-Type] header.
@mfussenegger I finally managed to get a heap dump. Sent to your email.
@SneakyMax Thx for the update.
I'd give you crate logs for the errors too, but unfortunately our crate logs are full of thousands and thousands of lines of
[2017-11-20T22:11:13,253][WARN ][o.e.d.r.RestController ] Content type detection for rest requests is deprecated. Specify the content type using the [Content-Type] header.
Are you using the crate python driver? This deprecation warning is fixed in the python driver version >= 0.20.1
. I really recommend to update to the latest driver since this deprecation warning seems to flood your logs which is not good for performance.
Regarding the issue the describes stucked queries listed in sys.jobs/operations
could you check if you get any disconnect warnings of participating nodes in the cluster? A problem could be that you run into network partitions or an issue that a node hangs (just because of flooded logs?! )
@mikethebeer We're using the node-crate
driver. It looks like Content-type
was added oct 11th, we just hadn't upgraded yet.
That would be hilarious if this probably was because of flooded logs. I'll see if I can find anything after upgrading the driver.
@SneakyMax ok, keep us informed if you see stucked queries again after the driver update. This was just an assumption that came to my mind first since I have the feeling that certain nodes become unresponsive and so also their queries get stuck.
@SneakyMax Another parameter that you could look at is the number of shards in the cluster. As I remember correctly there was an issue with hanging queries in sys.jobs
some time ago. In that scenario our customer got way to much shards assigned in the cluster.
Could you check also for the nr. of shards and replicas that you have configured for your tables? Send us also the nr. of nodes you have in your cluster so that we could compare the relation to each.
@tellezb @rps-v @SneakyMax is this still an issue for you?
@seut Yeah, currently have 39 statements in sys.jobs, none of them should be there. Crate 2.2.3.
re: @mikethebeer
@seut Yes, we also have around 10 jobs stuck which can't be killed (on a 5 node cluster running Crate 2.1.6).
Here are the statements from those jobs:
SELECT id, severity, description, passed FROM sys.checks WHERE passed = FALSE ORDER BY severity DESC, id
select schema_name \|\| '.' \|\| table_name, min_lucene_version from sys.shards where min_lucene_version not like '6.%.%' order by 1
Thanks for updating the issue again with some more information. Previously we thought this could be related to more complex queries but it seems to affect also simple queries with sys tables, which led me to check some of the job logs code and how the entries are cleaned up.
While looking through the code which is executed before the job log entry is being removed, I found that we don't remove the job log in case of channel errors which can occur during sending the Rest response. This could be the cause for this issue because all the clients used here are Rest based.
I've opened a PR (#6718), so this should be in soon and you could test whether the issue still occurs.
Closing this as we believe it's been fixed with 2.2.6.
Please re-open if anyone encounters this in 2.2.6+
We've upgraded now and stats.enabled
set to true, will let you know if it happens.
@mfussenegger we've been running 2.3 for a week now and no extra entries in sys.jobs!
We have also updated to 2.3.2 and still see a few stuck jobs which cannot be killed (a few UPDATE statements).
We have also updated to 2.3.2 and still see a few stuck jobs which cannot be killed (a few UPDATE statements).
Were the UPDATE statements executed when nodes have been restarted? We've a fix in 2.3.3 in that case.
And there is also a fix in 2.3.3 that should prevent queries on sys.checks
or sys.shards
from getting stuck.
If it's neither of that a heap dump would be very valuable in order for us to investigate further.
The stuck jobs are from today, but the nodes haven't been restarted since about 2 days ago (a few crashed).
The jobs are not from any sys
table, so it shouldn't be that.
Is there any way to determine which node did the stuck queries execute on ?
Is there any way to determine which node did the stuck queries execute on ?
The sys.operations
table contains a _node['name']
and _node['id']
column. You can join sys.jobs
and sys.operations
on jobs.id = operations.job_id
Something is strange... those 5 stuck UPDATE statements do not have entries in sys.operations
(_node is null via JOIN).
Also saw a stuck job from today with the following statement (could be the Admin UI ?):
SELECT table_name, schema_name, format('%s.%s', schema_name, table_name) AS fqn, _node['id'] AS node_id, state, routing_state, relocating_node, count(*), "primary", sum(num_docs), avg(num_docs), sum(size) FROM sys.shards GROUP BY table_name, schema_name, fqn, node_id, state, routing_state, relocating_node, "primary"
Something is strange... those 5 stuck UPDATE statements do not have entries in sys.operations (_node is null via JOIN).
Can you show us the exact update statements? There are different execution plans for UPDATE statements. If the UPDATE has a WHERE pk_column = ?
it's executed in a way that doesn't result in sys.operations entries.
Also saw a stuck job from today with the following statement (could be the Admin UI ?):
This is one of the queries I was referring to that shouldn't become stuck anymore in 2.3.3
Ah, I see.
All the UPDATE queries are like this :
UPDATE "tracknamic"."vehicle" SET "group_id"=? , "make"=? , "model"=? , ... WHERE id = '6fe97252-deac-468b-8412-3364cbb36405'
@rps-v
I've got a few more questions regarding the stuck update statements:
1) Are there any errors in the logs? 2) What kind of client are you using - is it postgres based and by chance are you using prepared statements? 3) Do you know if the client received responses for these statements, or did the client get stuck / run into a timeout?
@mfussenegger
@rps-v
So far I'm not able to reproduce this, nor can I see anything in the code that looks suspicious.
Did the number of stuck entries grow since you initially reported this?
but the nodes haven't been restarted since about 2 days ago (a few crashed).
Did the crashes happen around the time when the queries got stuck?
@mfussenegger
Since I reported this, no other stuck UPDATE
queries, however there were some from sys
tables (which you said is fixed in 2.3.3).
No, the crashes occured 2 days before the queries got stuck.
We will try to update to 2.3.3 and report if it happens again. I think the issue can be closed for now.
k thx, let us know once you've upgraded.
Environment: Crate 1.1.6 Amazon Linux AMI 4.9.38-16.33.amzn1.x86_64 6 AWS t2.small nodes, in 3 availability zones
Problem description: I posted this issue on Slack and was instructed to crate an issue here.
Seeing quite a few (over 100) entries in sys.jobs for statements with old start dates ... up to a week ago. The application which started the query is definitely not running and I don't seem to be able to KILL the job either (nothing happens).
Current 102 jobs. Killing a couple of them using KILL didn't change anything (the job still shows up in sys.jobs):
I've tried running a few of the queries found in sys.jobs and they all run in under a second. Nothing noteworthy is in the logs. Node CPU looks normal:
I haven't restarted the nodes in attempt to maintain this state. What else should I look for? Thanks.