Closed dennisgorelik closed 8 years ago
@dennisgorelik Sorry for replying so late. The percolator uses a special in-memory index and this index doesn't support doc values. The doc values workaround I shared in #6806 is incorrect.
In theory doc values support could be added to this in-memory index. What is the reason behind using doc values in the percolator? The performance should be better since #7081 got in.
@martijnvg The only reason why we even tried doc values was your recommendation (that now you are saying is incorrect).
The performance should be better since #7081 got in.
Yes, percolator's performance improved ~33% when we migrated from ElasticSearch 1.2.2 to ElasticSearch 1.4.3.
We are still looking for further performance improvements and improving stability of ElasticSearch percolator under high load. Currently to much stress (multiple queues) for long time causes ElasticSearch to stop producing results entirely. I guess the solution is adding multiple ElasticSearch nodes.
http://stackoverflow.com/questions/28864184/how-to-improve-percolator-performance-in-elasticsearch
@dennisgorelik The percolator is CPU intensive and adding more nodes will help to distribute the percolator load better. How much queries do you have stored and how many percolate requests are you executing concurrently?
We have about 160,000 queries stored. We created 2 shards on our JobIndex that we run percolator against. We have up to 4 concurrent percolate requests. In addition to that there could be other ElasticSearch queries (e.g. searches that are coming from our web server + some other batch searches). All that runs on "Dual Processor Octo Core Xeon 2650 - 2GHz+HT" - that's 32 threads CPU. We have single node ElasticSearch. We use: ES_HEAP_SIZE=6g Totally ElasticSearch consumes up to 12GB on our server.
Occasionally (about once every few days) we are getting ElasticSearch entering into "stupor" mode where ElasticSearch consumes all available CPU (e.g. 90%) with the following errors in the logs:
[2015-03-24 00:08:01,508][WARN ][monitor.jvm ] [Cancer] [gc][old][416159][31731] duration [34s], collections [1]/[34.5s], total [34s]/[5.6h], memory [5.7gb]->[4.1gb]/[5.8gb], all_pools {[young] [1.4gb]->[11.4mb]/[1.4gb]}{[survivor] [153.4mb]->[0b]/[191.3mb]}{[old] [4.1gb]->[4.1gb]/[4.1gb]}
ElasticSearch reboot helps.
Any suggestions about how can we prevent such ElasticSearch stupor?
There can be many reasons why you're running low on heapspace and long gc. It doesn't have to be percolator related. Can you check the node stats api and check if something else is using up the heapspace?
@martijnvg What kind of node stats are you asking me for? Do you need information like that?
Something like that?
curl -X GET "http://elasticsearchserver:9200/_nodes/stats/jvm?pretty=true"
{
"cluster_name" : "elasticsearch",
"nodes" : {
.....
"jvm" : {
"timestamp" : 1427258626635,
"uptime_in_millis" : 55962324,
"mem" : {
"heap_used_in_bytes" : 3744527576,
"heap_used_percent" : 59,
"heap_committed_in_bytes" : 6241845248,
"heap_max_in_bytes" : 6241845248,
"non_heap_used_in_bytes" : 1371365808,
"non_heap_committed_in_bytes" : 1805742080,
"pools" : {
"young" : {
"used_in_bytes" : 119134544,
"max_in_bytes" : 1605304320,
"peak_used_in_bytes" : 1605304320,
"peak_max_in_bytes" : 1605304320
},
"survivor" : {
"used_in_bytes" : 637400,
"max_in_bytes" : 200605696,
"peak_used_in_bytes" : 200605696,
"peak_max_in_bytes" : 200605696
},
"old" : {
"used_in_bytes" : 3624755632,
"max_in_bytes" : 4435935232,
"peak_used_in_bytes" : 3721644784,
"peak_max_in_bytes" : 4435935232
}
}
},
"threads" : {
"count" : 449,
"peak_count" : 451
},
"gc" : {
"collectors" : {
"young" : {
"collection_count" : 51319,
"collection_time_in_millis" : 1035200
},
"old" : {
"collection_count" : 966,
"collection_time_in_millis" : 896951
}
}
},
"buffer_pools" : {
"direct" : {
"count" : 585,
"used_in_bytes" : 53303365,
"total_capacity_in_bytes" : 53303365
},
"mapped" : {
"count" : 669,
"used_in_bytes" : 9273195645,
"total_capacity_in_bytes" : 9273195645
}
}
}
}
}
}
This is node stats api during normal (healthy) operations. Would I need to collect the same information when my ElasticSearch node experience problems?
@dennisgorelik Yes, the node stats api, but preferable all the information and when you're seeing the jvm gc warnings. The following request should include all information and show sizes in a human readable way:
curl -XGET "http://elasticsearchserver:9200/_nodes/stats?pretty=true&all&human"
What also can be helpful when jvm problems occur is a jvm heapdump, but you can share that with me privately.
@martijnvg 1) When our ElasticSearch server hangs again - we'll send you full node stats api response (based on your command). 2) How do I produce jvm heapdump for ElasticSearch in Windows? I tried:
jmap -dump:live,format=b,file=heap.bin 2400
and got:
2400: Access is denied
2400 is PID of elasticsearch-service-x64.exe on our server.
@martijnvg We got our ElasticSearch server entering half-dead state again. Basic search queries were taking ~15s (instead of usual 0 to 30 ms). Percolation timed out. Removing almost all load from ElasticSearch server for several minutes did NOT help. Only ElasticSearch reboot helped to recover.
Prior to entering half-dead state ElasticSearch was running percolation non-stop for ~6 hours.
See full node stats api result for our ElasticSearch in half-dead state:
{
"cluster_name" : "elasticsearch",
"nodes" : {
"gzJr6VoQSGK7mTr7MjfYGQ" : {
"timestamp" : 1427771500284,
.....
"indices" : {
"docs" : {
"count" : 2397045,
"deleted" : 400923
},
"store" : {
"size" : "8.9gb",
"size_in_bytes" : 9651063464,
"throttle_time" : "1.2m",
"throttle_time_in_millis" : 73703
},
"indexing" : {
"index_total" : 792427,
"index_time" : "39.5m",
"index_time_in_millis" : 2371584,
"index_current" : 0,
"delete_total" : 1917090,
"delete_time" : "2.2m",
"delete_time_in_millis" : 135614,
"delete_current" : 0,
"noop_update_total" : 0,
"is_throttled" : false,
"throttle_time" : "0s",
"throttle_time_in_millis" : 0
},
"get" : {
"total" : 4777,
"get_time" : "4.1s",
"time_in_millis" : 4192,
"exists_total" : 4777,
"exists_time" : "4.1s",
"exists_time_in_millis" : 4192,
"missing_total" : 0,
"missing_time" : "0s",
"missing_time_in_millis" : 0,
"current" : 0
},
"search" : {
"open_contexts" : 18,
"query_total" : 9285622,
"query_time" : "2.9d",
"query_time_in_millis" : 253663579,
"query_current" : 8,
"fetch_total" : 8472373,
"fetch_time" : "5.6h",
"fetch_time_in_millis" : 20164442,
"fetch_current" : 0
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 4148,
"total_time" : "39.2m",
"total_time_in_millis" : 2353027,
"total_docs" : 22699407,
"total_size" : "15.2gb",
"total_size_in_bytes" : 16400873095
},
"refresh" : {
"total" : 44893,
"total_time" : "22.2m",
"total_time_in_millis" : 1332971
},
"flush" : {
"total" : 3138,
"total_time" : "2.2m",
"total_time_in_millis" : 137586
},
"warmer" : {
"current" : 0,
"total" : 87566,
"total_time" : "10s",
"total_time_in_millis" : 10033
},
"filter_cache" : {
"memory_size" : "33.4mb",
"memory_size_in_bytes" : 35027716,
"evictions" : 20064952
},
"id_cache" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0
},
"fielddata" : {
"memory_size" : "33.9mb",
"memory_size_in_bytes" : 35613020,
"evictions" : 0
},
"percolate" : {
"total" : 1577725,
"get_time" : "20.4d",
"time_in_millis" : 1769513471,
"current" : 27,
"memory_size_in_bytes" : -1,
"memory_size" : "-1b",
"queries" : 366200
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 186,
"memory" : "20.4mb",
"memory_in_bytes" : 21488188,
"index_writer_memory" : "1.7mb",
"index_writer_memory_in_bytes" : 1863332,
"index_writer_max_memory" : "595.2mb",
"index_writer_max_memory_in_bytes" : 624184520,
"version_map_memory" : "24.8kb",
"version_map_memory_in_bytes" : 25396,
"fixed_bit_set" : "0b",
"fixed_bit_set_memory_in_bytes" : 0
},
"translog" : {
"operations" : 3439,
"size" : "20.6kb",
"size_in_bytes" : 21152
},
"suggest" : {
"total" : 0,
"time" : "0s",
"time_in_millis" : 0,
"current" : 0
},
"query_cache" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"evictions" : 0,
"hit_count" : 0,
"miss_count" : 0
}
},
"os" : {
"timestamp" : 1427771501003,
"uptime" : "9.4m",
"uptime_in_millis" : 568846,
"cpu" : {
"sys" : 0,
"user" : 22,
"idle" : 77,
"usage" : 22,
"stolen" : 0
},
"mem" : {
"free" : "909.6mb",
"free_in_bytes" : 953839616,
"used" : "31gb",
"used_in_bytes" : 33370476544,
"free_percent" : 3,
"used_percent" : 96,
"actual_free" : "1002.3mb",
"actual_free_in_bytes" : 1051021312,
"actual_used" : "30.9gb",
"actual_used_in_bytes" : 33273294848
},
"swap" : {
"used" : "30.5gb",
"used_in_bytes" : 32848777216,
"free" : "16.4gb",
"free_in_bytes" : 17689870336
}
},
"process" : {
"timestamp" : 1427771501018,
"open_file_descriptors" : 5960,
"cpu" : {
"percent" : 0,
"sys" : "5.2m",
"sys_in_millis" : 312838,
"user" : "7.3s",
"user_in_millis" : 7335,
"total" : "5.3m",
"total_in_millis" : 320173
},
"mem" : {
"resident" : "2.1gb",
"resident_in_bytes" : 2335469568,
"share" : "-1b",
"share_in_bytes" : -1,
"total_virtual" : "1.8gb",
"total_virtual_in_bytes" : 1965916160
}
},
"jvm" : {
"timestamp" : 1427771501018,
"uptime" : "6.5d",
"uptime_in_millis" : 568852806,
"mem" : {
"heap_used" : "5.6gb",
"heap_used_in_bytes" : 6104360960,
"heap_used_percent" : 97,
"heap_committed" : "5.8gb",
"heap_committed_in_bytes" : 6241845248,
"heap_max" : "5.8gb",
"heap_max_in_bytes" : 6241845248,
"non_heap_used" : "568.1mb",
"non_heap_used_in_bytes" : 595737264,
"non_heap_committed" : "1.1gb",
"non_heap_committed_in_bytes" : 1272410112,
"pools" : {
"young" : {
"used" : "1.4gb",
"used_in_bytes" : 1605304320,
"max" : "1.4gb",
"max_in_bytes" : 1605304320,
"peak_used" : "1.4gb",
"peak_used_in_bytes" : 1605304320,
"peak_max" : "1.4gb",
"peak_max_in_bytes" : 1605304320
},
"survivor" : {
"used" : "60.2mb",
"used_in_bytes" : 63160368,
"max" : "191.3mb",
"max_in_bytes" : 200605696,
"peak_used" : "191.3mb",
"peak_used_in_bytes" : 200605696,
"peak_max" : "191.3mb",
"peak_max_in_bytes" : 200605696
},
"old" : {
"used" : "4.1gb",
"used_in_bytes" : 4435934656,
"max" : "4.1gb",
"max_in_bytes" : 4435935232,
"peak_used" : "4.1gb",
"peak_used_in_bytes" : 4435935232,
"peak_max" : "4.1gb",
"peak_max_in_bytes" : 4435935232
}
}
},
"threads" : {
"count" : 471,
"peak_count" : 474
},
"gc" : {
"collectors" : {
"young" : {
"collection_count" : 556417,
"collection_time" : "2.1h",
"collection_time_in_millis" : 7611784
},
"old" : {
"collection_count" : 34576,
"collection_time" : "5.5h",
"collection_time_in_millis" : 20125610
}
}
},
"buffer_pools" : {
"direct" : {
"count" : 1127,
"used" : "61.9mb",
"used_in_bytes" : 64991049,
"total_capacity" : "61.9mb",
"total_capacity_in_bytes" : 64991049
},
"mapped" : {
"count" : 766,
"used" : "8.9gb",
"used_in_bytes" : 9608816056,
"total_capacity" : "8.9gb",
"total_capacity_in_bytes" : 9608816056
}
}
},
"thread_pool" : {
"percolate" : {
"threads" : 32,
"queue" : 0,
"active" : 27,
"rejected" : 0,
"largest" : 32,
"completed" : 85311
},
"bench" : {
"threads" : 0,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 0,
"completed" : 0
},
"listener" : {
"threads" : 10,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 10,
"completed" : 32291
},
"index" : {
"threads" : 32,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 32,
"completed" : 84
},
"refresh" : {
"threads" : 8,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 8,
"completed" : 44873
},
"suggest" : {
"threads" : 0,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 0,
"completed" : 0
},
"generic" : {
"threads" : 2,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 7,
"completed" : 69713
},
"warmer" : {
"threads" : 3,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 4,
"completed" : 42844
},
"search" : {
"threads" : 96,
"queue" : 0,
"active" : 8,
"rejected" : 0,
"largest" : 96,
"completed" : 20079486
},
"flush" : {
"threads" : 5,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 5,
"completed" : 43953
},
"optimize" : {
"threads" : 0,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 0,
"completed" : 0
},
"management" : {
"threads" : 5,
"queue" : 22,
"active" : 5,
"rejected" : 0,
"largest" : 5,
"completed" : 344629
},
"get" : {
"threads" : 32,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 32,
"completed" : 4778
},
"merge" : {
"threads" : 5,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 5,
"completed" : 46789
},
"bulk" : {
"threads" : 32,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 32,
"completed" : 42574
},
"snapshot" : {
"threads" : 0,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 0,
"completed" : 0
}
},
"network" : {
"tcp" : {
"active_opens" : 125109,
"passive_opens" : 1728257,
"curr_estab" : 706,
"in_segs" : 911340183,
"out_segs" : 1034136815,
"retrans_segs" : 767107,
"estab_resets" : 429327,
"attempt_fails" : 128973,
"in_errs" : 928,
"out_rsts" : 413521
}
},
"fs" : {
"timestamp" : 1427771501018,
"total" : {
"total" : "744.9gb",
"total_in_bytes" : 799925071872,
"free" : "510.4gb",
"free_in_bytes" : 548089225216,
"available" : "510.4gb",
"available_in_bytes" : 548089225216,
"disk_reads" : 50198038,
"disk_writes" : 131904978,
"disk_io_op" : 182103016,
"disk_read_size" : "1.1gb",
"disk_read_size_in_bytes" : 1199022592,
"disk_write_size" : "298.8mb",
"disk_write_size_in_bytes" : 313397248,
"disk_io_size" : "1.4gb",
"disk_io_size_in_bytes" : 1512419840,
"disk_queue" : "0"
},
"data" : [ {
"path" : "F:\\elasticsearch-1.4.3\\data\\elasticsearch\\nodes\\0",
"mount" : "F:\\",
"dev" : "F:\\",
"total" : "744.9gb",
"total_in_bytes" : 799925071872,
"free" : "510.4gb",
"free_in_bytes" : 548089225216,
"available" : "510.4gb",
"available_in_bytes" : 548089225216,
"disk_reads" : 50198038,
"disk_writes" : 131904978,
"disk_io_op" : 182103016,
"disk_read_size" : "1.1gb",
"disk_read_size_in_bytes" : 1199022592,
"disk_write_size" : "298.8mb",
"disk_write_size_in_bytes" : 313397248,
"disk_io_size" : "1.4gb",
"disk_io_size_in_bytes" : 1512419840,
"disk_queue" : "0"
} ]
},
"transport" : {
"server_open" : 13,
"rx_count" : 6,
"rx_size" : "1.3kb",
"rx_size_in_bytes" : 1356,
"tx_count" : 6,
"tx_size" : "1.3kb",
"tx_size_in_bytes" : 1356
},
"http" : {
"current_open" : 38,
"total_opened" : 5428
},
"breakers" : {
"request" : {
"limit_size_in_bytes" : 2496738099,
"limit_size" : "2.3gb",
"estimated_size_in_bytes" : 32880,
"estimated_size" : "32.1kb",
"overhead" : 1.0,
"tripped" : 0
},
"fielddata" : {
"limit_size_in_bytes" : 3745107148,
"limit_size" : "3.4gb",
"estimated_size_in_bytes" : 35616584,
"estimated_size" : "33.9mb",
"overhead" : 1.03,
"tripped" : 0
},
"parent" : {
"limit_size_in_bytes" : 4369291673,
"limit_size" : "4gb",
"estimated_size_in_bytes" : 35649464,
"estimated_size" : "33.9mb",
"overhead" : 1.0,
"tripped" : 0
}
}
}
}
}
@dennisgorelik I see that you in total have 366200
queries stored. That itself may take a big part of of your jvm heap space. What kind of queries are you storing in the percolator? (anything with a wildcard?) I assume that during this 6 hours you also add new percolator queries to ES?
The percolator parses the queries and keeps the parsed variant of the queries (Lucene queries) in the jvm heapspace. Some queries use more memory than others. But regardless of that a node can only hold a finite amount of queries. So you at some point you need to add another node. Also in your case giving your current node more jvm heap space will also help.
@martijnvg 1) We have 150,909 abandoned job percolators (after we moved job percolators into percolatejobs index that is separate from underlying jobs index). We do not run any queries against these abandoned percolators. Do these abandoned percolators still use ElasticSearch memory?
a node can only hold a finite amount of queries
How many percolator queries can ElasticSearch node hold? Or put it another way: how much memory does typical percolator query (without wildcards) takes from jvm heap?
2) We do not have any wild card queries at all. Here is a typical example of percolator queries we have:
"_source":{"query" : { "filtered" : { "query" : { "query_string" : { "fields" : ["JobTitle","JobDescription","CompanyName","Salary"], "query" : "\"QA.QC\" AND MANUFACTURE", "analyzer" : "pjfqueryanalyzer" } }, "filter" : { "and" : [{ "geo_distance" : { "distance" : "25mi", "Location" : {"lat":40.506772,"lon":-74.265423} }}] } } } , "SystemUpdateDate": 1406362400409}
3) Yes, during 6 hours of percolation we had some percolator queries being added. But not many. Probably less than 100 new queries. Overall ElasticSearch server was running for 6 days since we had to reboot it last time.
@dennisgorelik I see you have swap enabled. Do you have mlockall enabled as well? You can check with:
GET _nodes/process
If not, then this could be the cause of your slow garbage collections. See http://www.elastic.co/guide/en/elasticsearch/reference/current/setup-configuration.html#setup-configuration-memory for more
@clintongormley
We are running ElasticSearch on Windows Server 2012. Unfortunately the article you suggested only describes how to disable memory swap on Linux.
Do you know how to disable swap for ElasticSearch on Windows? That question left unanswered on StackOverflow for 8 months: http://stackoverflow.com/questions/25297682/how-to-avoid-elasticsearch-process-being-swapped-on-windows-servers
In elasticsearch.yml we set:
bootstrap.mlockall: true
However under Windows ElasticSearch ignores that setting and has mlockall=false:
GET _nodes/process
{
"cluster_name": "elasticsearch",
"nodes": {
"XF2RV7JLSvOVpK7QAVPv9Q": {
"name": "Armand Martel",
"transport_address": "inet[/xxx.xxx.xxx.xxx:9300]",
"host": "sv7731",
"ip": "xxx.xxx.xxx.xxx",
"version": "1.4.3",
"build": "36a29a7",
"http_address": "inet[/xxx.xxx.xxx.xxx:9200]",
"process": {
"refresh_interval_in_millis": 1000,
"id": 24116,
"max_file_descriptors": -1,
"mlockall": false
}
}
}
}
@dennisgorelik we're working on it for windows (see https://github.com/elastic/elasticsearch/pull/9186) but you can always disable the page file on windows for now.
@clintongormley 1) Do you mean disable the page file for all programs in Windows? http://lifehacker.com/5426041/understanding-the-windows-pagefile-and-why-you-shouldnt-disable-it
2) When do you think VirtualLock will be supported on Windows? Couple of months? A year?
1) Do you mean disable the page file for all programs in Windows? http://lifehacker.com/5426041/understanding-the-windows-pagefile-and-why-you-shouldnt-disable-it
Yes. If all of you've got running is Elasticsearch, then the advice from that article doesn't apply.
When do you think VirtualLock will be supported on Windows?Couple of months? A year?
Hopefully soon. We've been waiting for feedback from Windows experts, but I think we now have enough info to proceed.
@clintongormley Aside of ElasticSearch we are running other programs on the same server (IIS, SQL Server). It probably does not make sense to run single ElasticSearch on Windows platform. If/when we move ElasticSearch into separate server then it would be a Linux server (less frequent reboots, no licensing fees, better configuration support in ElasticSearch).
Thank you for your reply and looking forward for VirtualLock support on Windows.
Disabling the page file and setting mlockall: true in the yml doesn't fix the issue on windows. mlockall is still set to false (/_nodes/process) and performance is still poor. swap space used is basically the entire remaining memory of the node (50%).
We moved our ElasticSearch job percolation functionality from Windows server to ElasticSearch cluster on two Linux VPS-es (3GB RAM + 2GB RAM). Percolation performance improved at a fraction of hosting price (relative to price of dedicated Windows server). We can also increase cluster performance just by adding more nodes to our ElasticSearch cluster.
@FrederickManley mlockall is linux only. See https://github.com/elastic/elasticsearch/pull/9186 for support for something similar coming in Windows
The original issue appears to have been fixed. Closing
Summary
When we specify doc_value=true on geo property in our "jobs" index then percolator stops working properly and is not able to find queries that limit results by geo distance. Percolator is still able to find queries that are not limited by distance.
We were able to reproduce that issue on both ElasticSearch 1.2.2 and ElasticSearch 1.4.3. We did not try any other versions.
We were able to reproduce that issue in both percolate and mpercolate (batch percolate). See also: https://github.com/elasticsearch/elasticsearch/issues/6806
Below I’m going to demonstrate how to:
1) Create index with doc_value. That way percolator is not capable to find my alert. The alert searches for jobs that are within 50 miles, but cannot find job item we created that is ~10 miles away. 2) Re-create the same index, but without doc_value. This time percolator is capable to find the same alert.
1) Reproducing bug (doc_value = true)
With "doc_value" percolator can NOT find local alert. Delete old index (if any)
Create new “jobs” index
Add one query (job alert) into jobs percolator.
Percolate job that is matching with alert that we inserted above. It cannot find any matching queries because of the bug in how ElasticSearch handles doc_type=true
2) Correct behavior (doc_value is not specified)
As you can see, if doc_value=false then percolator works correctly.