Leaking file handles? - Githubissues

consulthys commented 10 years ago

I know this issue has been raised a couple times in the past, but none of them has provided me a solution so far. I've also seen similar issues being raised in forums.

We're running ES 0.90.7 on a two-node EC2 cluster with the following settings:

Total RAM 17GB
ES_HEAP_SIZE 9GB
open files (ulimit -n) 65535

We have five different indexes (13GB, 7GB, 4GB, 22MB, 4MB), we are running about 150K indexing operations per day (mostly new docs, but also some updates). Depending on days, we're running between 60K and 120K search operations.

Here is a recurrent pattern that has been consistently happening over the past couple months and which requires us to restart the cluster every 3/4 days or so. The graphics below show the evolution of the heap size, the process CPU and the open file handles from one restart (~12PM on Feb 15th) to the next one (~8PM on Feb 19th). Initially, everything runs smoothly. Node 2 was the master in that run and was the one processing the ActiveMQ river.

Then on Feb 18th around 12PM (orange vertical bar), something starts happening. At that time, the same things keep happening on the master node:

the process CPU starts increasing and stays high
the file handles count keeps increasing (but more steeply)
The GC doesn't seem to be able to release memory

capture decran 2014-02-20 a 05 26 24

Everything keeps running more or less smoothly for a day or so up until the file handles are exhausted. Just before they do, we restart the problematic node (green vertical bar), the second node becomes the master and the same pattern repeats with that node.

Another thing worth noting is that about 90% of the open file handles are marked (deleted) and belong 99% to the biggest index, i.e. the one that is 13GB.

We're going to upgrade to 0.90.11 anytime soon to see if Lucene 4.6.1 will be of any help, but in the meantime, we'd appreciate any kind of insights as to why this kind of things is happening.

spinscale commented 10 years ago

Can you run and attach the output of a hot_threads operation when the CPU is high? This will show you, which elasticsearch java classes use the most CPU Do you hit your system with scan search queries? That could be a reason for open file handles on deleted files.

I dont know anything about the active mq river? Does this also happen when it is switched off? Or does this also happen on the node if the river is running on the other one?

consulthys commented 10 years ago

Thanks for your answer Alexander. We've been monitoring hot threads regularly, but nothing struck us so far. We will look again very closely within the next 2-3 days and get some samples regularly all over the critical time period.

We do no use scan search queries, only the default one (query_then_fetch). Although, granted, as most of our queries return no docs (i.e. with size=0) and are mostly run for facetting purposes, we could switch to search_type=count for those, that will probably improve things a bit.

We cannot switch off the river, unfortunately. Though, it seems to consistently happen on the node which is running the river (whether master or not).

s1monw commented 10 years ago

just a blind guess but this looks like a big merge - do you have stats of used disk space during that period? Do you have any non-standard configurations and are you doing something special during that time?

consulthys commented 10 years ago

We have about 250GB available disk space on those machines and ES is the only thing running on them. So substracting system files and our 25GB indexed data, that leaves about 200GB of free space.

We have a pretty standard configuration, we only tweaked the heap sizes for the filters and fielddata caches to better handle our situation. Aside from that nothing else is running.

I'll watch again the disk space and other processes the next time it happens and report the findings here.

bleskes commented 10 years ago

As an experiment - is it possible for you add a third node (potentially on the same server as one of the current node), turn off it's master and data roles and force the river to be allocated on it. This will separate the river functionality from the other nodes and would allow us to see whether the problem is river specific.

The config on the river node (in the elasticsearch.yml) should be something like

node.master: false
node.data: false

node.river: rabbitmq

On the other two nodes, add the following to the elaticsearch.yml:

node.river: _none_

You will have to restart the nodes for this to have affect.

consulthys commented 10 years ago

That's effectively one measure we were planning to enforce. We will try that as well. Thanks guys!!

bleskes commented 10 years ago

Just note that it is a temporary research solution as it will make the river node a single point of failure - the river will never be re-assigned upon a failure of that node.

consulthys commented 10 years ago

Quick note to point out that we've taken two measures so far:

upgrading to 0.90.11 (Lucene 4.6.1)
making sure that all queries which only return facets and no hits are using search_type=count

Both of these measures have had no effect so far, as the same pattern is repeating.

We've yet to experiment what @bleskes brought up above, but it's not trivial to do it in production with minimal interruption. Stay tuned...

s1monw commented 10 years ago

any updates on this one?

consulthys commented 10 years ago

Nope, we haven't been able to run the experiment depicted by @bleskes yet, but we will soon.

One thing I'm eager to discover is whether

the GC starts going south (and the CPU as well) because the file handles start leaking, or
the file handles start leaking because the GC starts going south (and the CPU as well).

Judging by the above charts, it looks like the GC starts behaving weird (i.e. when the heap can't be cleaned correctly anymore) before the file handles count starts increasing, but nothing clear cut.

consulthys commented 10 years ago

Here's another interesting capture from yesterday when we reindexed 10M+ documents. The red node is initially the master and runs the river. We've let it proceed until the heap reached 99%, at which point we restarted it (green vertical line) and the blue node became master.

What we can see is that both nodes were actively indexing documents, yet the open file handle count only increases on the master.

Also when the blue node became master and picked up the river, its file handle count increased much more quickly, probably because its heap was already around 70% and kept increasing steadily up to 99%, where we restarted it and the red node became master again.

The dashed gray lines denote that whenever the file handle count crosses a certain threshold and keeps increasing, the CPU goes way up. Or maybe as commented earlier, it's vice versa, i.e. that whenever the CPU goes way up (because heap is stressed because GC can't do its job), the file handles are not recycled anymore and keep increasing.

It also looks like the heap on the master node never has as much leeway as the heap on the worker node, i.e. no "sawtooth" pattern.

screenshot

dadoonet commented 10 years ago

I gave a Quick look at ActiveMQ river plugin and it looks like they create a new Bulk request for every new message: https://github.com/domdorn/elasticsearch-river-activemq/blob/master/src/main/java/org/elasticsearch/river/activemq/ActiveMQRiver.java#L263

Looks bad to me.

May be I misread the code though.

consulthys commented 10 years ago

Thank you David, I appreciate your input. That might be a good point. Though, the ActiveMQ river plugin was forked from the official RabbitMQ river plugin, which does pretty much the same: https://github.com/elasticsearch/elasticsearch-river-rabbitmq/blob/master/src/main/java/org/elasticsearch/river/rabbitmq/RabbitmqRiver.java#L323

We'll know soon if that's an issue, the third node has been prepared and the experiment is in progress. Stay tuned...

consulthys commented 10 years ago

We've ran the experiment that @bleskes described above. The problem doesn't seem to be tied to the river (Node 3 in green), as the count of (deleted) open file handles still increases on the master node (Node 2 in blue).

I'm curious what makes the master node special in terms of why the file handles keep leaking on it, but not on the other worker node (Node 1 in red).

Another interesting piece of info is that the deleted docs ratio on both nodes is between 30% and 35%, which seems like a lot. The HQ plugin hints at slow IO when that ratio goes over 25%.

We're also making strong use of filter and fielddata caches (both set to only use 30% of the heap), but I'm unsure yet if and how this could be an issue.

I appreciate any further insights. Thanks.

capture decran 2014-03-17 a 12 30 15

bleskes commented 10 years ago

@consulthys thx for the input. This indeed suggest the river is not the issue. I wonder if you have some background process that continuously checks for some master-level (i.e., cluster meta data) info like cluster health. Maybe that one doesn't properly close it's connections.

consulthys commented 10 years ago

We don't have anything else but ES running on those nodes. We do check the cluster health remotely via HTTP (mostly to get those graphs above), but the vast majority of (deleted) open file handles are not socket handles, but file handles to Lucene index files (lots of cfs, fdt, etc) and other ES internal files (doc, tim, pos, pay, etc).

consulthys commented 10 years ago

In order to shed some more light on those file handles I was mentioning earlier, here is a quick breakdown for the current set of the 12,194 open (deleted) file handles that we have right now:

cfs: 4046 (lucene compound file)
pos: 1358 (positions file)
pay: 1358 (payload file)
tim: 1358 (terms dictionary)
doc: 1358 (frequencies and skip data)
fdt: 1358 (field data)
nvd: 1358 (doc values data)

It somehow looks like an old Lucene bug (https://issues.apache.org/jira/browse/LUCENE-2762) but I'd assume it's fixed by now. I have yet to dive into the ES code that uses IndexReader/Writer/Searcher, but appreciate any kinds of insights regarding this.

Also, could it be that our filter cache is not sized adequately (currently 30% of 9GB heap)? When I look at the charts below representing the evolution of the filters and fielddata cache sizes during the experiment, it looks like the filters cache could use some more space. I'm not sure if this would have any impact at all, but I'm just pointing it out.

capture decran 2014-03-18 a 05 15 47

bleskes commented 10 years ago

@consulthys thx for the info. Although we typically only discuss issues here (and answer more general questions on the mailing list),I figured you earned the question :) - look at the amount of rejections from your filter cache and query performance (do you see spikes?) to decide wether you want to give more space to you filter cache.

That said - let's go back to the file handles. It's not uncommon to have so many file handles and that's OK. What I don't understand is why the master node seem to have it's file handles grow during indexing and just drop (same goes for high cpu usage). In the last image you gave both nodes are the same. I would be interested in the same kind of analysis of the master node when it's unusually high. Also can you give the output GET _nodes/hot_threads at the same time? This may give us insights into what it is actually doing.

consulthys commented 10 years ago

Thank you @bleskes.

I'm ok with having a high file handle count on any node, too, as long as we know that it will decrease at some point. The thing is that it just never decreases. When you see a drop, it's because we had to restart the node. So far, we've basically never seen the file handle count decrease at all on the master, it's always steadily increasing and is only a matter of time until the resources (CPU, RAM, file handles, etc) get exhausted. Even though a couple K new documents are being indexed every couple seconds, I'd still expect the file handles count to grow/shrink as needed, but I'd expect to see the count shrink eventually, and most of all, I'd expect the (deleted) file handles to really be deleted at some point. This is telling me that something within ES (or Lucene) must still be referencing those files, and that's the big unknown for me so far.

I'm going to dig up some more info (hot threads et al) and get back here as soon as I have something.

consulthys commented 10 years ago

Moreover, if you think I should be moving this to the mailing list until we figure out if this is an issue or not, please let me know. My guts tell me that there's something here, and I can hardly believe I'm the only one witnessing this kind of behavior.

bleskes commented 10 years ago

We have everything here already, so I think we can do it here. Let me know when you have news.

consulthys commented 10 years ago

As discussed, I'm providing some new insights on the hot threads running on the master node. What I did was to call _nodes/hot_threads on the master node every minute since the beginning of the last run until we had to restart the master a couple minutes ago (i.e. a time span of ~34h). That gives us ~5,3K samples to work on. From those, I'm filtering out all threads whose CPU usage is below 10% and the threads from the [management] pool as they are not that interesting. That leaves us 83 hot thread samples whose CPU usage ranges from 10% to 80%+, let's call them "very hot threads". Let me know if you need all 5K samples, but at quick glance, I think 95% of those are just noise.

What we can see is that the very hot threads appearing most often come (unsurprisingly) from the following pools:

bulk
refresh
search
Lucene merge

Note: For the latter group, you'll see I've renamed the index on which the merge operation is being carried out to "biggest_index", "2nd_biggest_index" and "3rd_biggest_index", because the real index names as such don't really mean anything without more business context.

You can find the gist here: https://gist.github.com/consulthys/49fd0fb6ad71222f4a75

Let's see if we can find any meaningful nugget of information in there.

Also, for reference purposes, here is a discussion I could find on the mailing list which deals with a very similar topic, but no answers so far. https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/open$20file$20handle/elasticsearch/i-qpbSj1Qrc/R84KR50d7OwJ

consulthys commented 10 years ago

In order to bring the hot threads output I shared earlier into perspective with the stats charts, I've also added on the chart the hot threads pools we're interested in (bulk, refresh, etc) and their usage percentage along the same timeline as the other charts. The result is shown below and what gets immediately apparent is that the refresh + Lucene merge threads are getting extra busy in the end (far right) which ultimately led to a restart of the master node (in red) since the heap was exhausted. capture decran 2014-03-24 a 10 19 23

I'm also attaching a zoom of the last 90 minutes of hot threads activity capture: capture decran 2014-03-24 a 10 30 36

bleskes commented 10 years ago

I've looked at the hot threads dumps and also at the latest charts (where filter evictions + cpu are interesting). It seems you are using a plugin to add an update by query functionality: https://github.com/yakaz/elasticsearch-action-updatebyquery . If I read the code correctly, that one makes all it's search on the primary shards and that puts those under heavy load. This makes we wonder if the issue is not so much which node is master but which holds the primary shards. As soon as the node is restarted, the shards on the other node are promoted to primary, which explains the pattern you see.

As an experiment - can you disable that plugin and see if this helps?

consulthys commented 10 years ago

Thanks for your input. We're indeed using the updatebyquery plugin (mainly because of https://github.com/elasticsearch/elasticsearch/issues/1607 and https://github.com/elasticsearch/elasticsearch/issues/2230). Unfortunately, we cannot disable that plugin in production.

On "master vs primary", I think that would make perfect sense. Most of the time, all primary shards are on the master since we're only bringing down one node at a time and waiting for full cluster recovery before restarting the second node. Doing so will always put primary shards on the master, indeed. But, I do remember having restarted both nodes at the same time and seeing the primary shards being distributed among both nodes. When that happened (see chart below), the "open files count" for the worker node wasn't flat as in the latter chart above, but instead the count was also increasing as on the master.

capture decran 2014-03-24 a 11 37 37

So, we're zeroing in onto something here. It definitely looks like "master node or not" is not the real issue, but "primary shard or not" is, especially in the context of using the updatebyquery plugin which does all the work on the primary shards. Maybe @martijnvg and potentially @ofavre could shed some light on this.

As a side note, do you happen to have any insider info on the timeline for completing #2230?

consulthys commented 10 years ago

Looking at the source code of TransportShardUpdateByQueryAction.java, I'm wondering if the index searcher/reader are being properly closed/released in all situations as pointed out in other discussions, such as:

martijnvg commented 10 years ago

@consulthys Yes, there might be a bug there. Not sure yet, it is a long time ago since I wrote that code and a lot has changed since then (the es code base). It should be released once the SearchContext is released. I'm rebasing the updatebyquery branch now and I'll double check is the searcher is released properly.

The query is executed only on the primary shards, so that the re-index operations can immediately happen in on the primary shards and then can be send to the replicas without executing the query there as well.

The reason that update by query is never pushed to ES, is that update by query operation can be a long running operation and there is no way of finding if an update by query is running and the cancel it.

ofavre commented 10 years ago

@martijnvg You'd better work from the plugin as it has already been rebased a few times, and there are a few additions that have been made to your original code. Thanks!

martijnvg commented 10 years ago

The update by query code doesn't close the SearchContext when on a shard no documents match with the query (which is likely to happen) or no docs match at all with the specified query.

In the TransportShardUpdateByQueryAction#doExecuteInternal() method in the following if statement:

if (docsToUpdateCount == 0) {
   ....
}

The following statement should be added in order to close the search context and release any resources associated with it:

searchContext.release();

consulthys commented 10 years ago

@martijnvg and @ofavre I've pulled the 1.4 branch of the updatebyquery plugin, applied the proposed one-liner, repackaged the plugin and restarted both nodes. I'm monitoring right now and we'll see how it goes today, but one great news is that I can already see the count of deleted files decrease and get down to 0, which it never ever did. The open files count is currently the same on both nodes and steady around 2K after 3 hours of operation.

@bleskes Thanks for showing the direction ;)

martijnvg commented 10 years ago

@consulthys Good to hear, keeps us updated.

bleskes commented 10 years ago

Glad to here it all works now. I'm closing this ticket as this is not an ES issue. I suggest that if something pops up you will open a new issue on the plugin repo. Cheers.

consulthys commented 10 years ago

Thanks everyone who contributed. Much appreciated! The devil is in the details as it turns out a single one-liner was causing all this.

@bleskes next time I'll make sure to post to the mailing list first, thanks for your understanding ;)

s1monw commented 10 years ago

thanks for the patience and bringing up the issue!!

elastic / elasticsearch

Leaking file handles? #5189