ElasticSearch percolator cannot find local query if doc_value=true

dennisgorelik commented 9 years ago

Summary

When we specify doc_value=true on geo property in our "jobs" index then percolator stops working properly and is not able to find queries that limit results by geo distance. Percolator is still able to find queries that are not limited by distance.

We were able to reproduce that issue on both ElasticSearch 1.2.2 and ElasticSearch 1.4.3. We did not try any other versions.

We were able to reproduce that issue in both percolate and mpercolate (batch percolate). See also: https://github.com/elasticsearch/elasticsearch/issues/6806

Below I’m going to demonstrate how to:

1) Create index with doc_value. That way percolator is not capable to find my alert. The alert searches for jobs that are within 50 miles, but cannot find job item we created that is ~10 miles away. 2) Re-create the same index, but without doc_value. This time percolator is capable to find the same alert.

1) Reproducing bug (doc_value = true)

With "doc_value" percolator can NOT find local alert. Delete old index (if any)

DELETE http://www.elasticsearchserver.com:9300/jobs HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Host: www.elasticsearchserver.com:9300

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 21

{"acknowledged":true}

Create new “jobs” index

POST http://www.elasticsearchserver.com:9300/jobs HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Host: www.elasticsearchserver.com:9300
Content-Length: 1813
Expect: 100-continue

{
"settings" : {
"index" : {
"number_of_shards" : 4
},
"analysis" : {
"tokenizer" : {
"pjfpatterntokenizer" : {
"type" : "pattern",
"pattern" : "[^\\p{L}\\p{N}#+.]+|(?<![a-z])#|(?<![a-z+])\\++|\\.(?!net)|(?<=#)|(?<=\\+)(?!\\+)|(?=\\.)|(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)",
"flags" : "CASE_INSENSITIVE"
}
},
"analyzer" : {
"pjfdocumentanalyzer" : {
"tokenizer" : "pjfpatterntokenizer",
"filter" : ["lowercase", "pjfsynonym"]
},
"pjfqueryanalyzer" : {
"type" : "custom",
"tokenizer" : "pjfpatterntokenizer",
"filter" : ["lowercase"]
}
},
"filter" : {
"pjfsynonym" : {
"type" : "synonym",
"ignore_case" : "true",
"expand" : "true",
"synonyms_path" : "postjobfree/pjfsynonyms.txt"
}
}
}
},
"mappings" : {
"job" : {
"properties": {
"JobId": {
"type": "long"
},
"PostedDate": {
"type": "date"
},
"JobTitle": {
"type": "string",
"analyzer" : "pjfdocumentanalyzer"
},
"JobDescription": {
"type": "string",
"analyzer" : "pjfdocumentanalyzer"
},
"CompanyName": {
"type": "string",
"analyzer" : "pjfdocumentanalyzer"
},
"Salary": {
"type": "string",
"analyzer" : "pjfdocumentanalyzer"
},
"Country": {
"type": "string",
"index" : "not_analyzed"
},
"State": {
"type": "string",
"index" : "not_analyzed"
},
"RecruiterId": {
"type": "string",
"index" : "not_analyzed"
},
"Location": {
"type": "geo_point",
"doc_values": "true"
},
"SystemUpdateDate": {
"type": "date"
},
"PenaltyFactor": {
"type": "float"
}
}
}
}
}

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 21

{"acknowledged":true}

Add one query (job alert) into jobs percolator.

POST http://www.elasticsearchserver.com:9300/jobs/.percolator/_bulk HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Host: www.elasticsearchserver.com:9300
Content-Length: 416
Expect: 100-continue
Connection: Keep-Alive

{ index : { _index : "jobs", _type : ".percolator", _id : 59391 }}
{"query" : { "filtered" : { "query" : { "query_string" : { "fields" : ["JobTitle","JobDescription","CompanyName","Salary"], "query" : "ACCOMMODATION", "analyzer" : "pjfqueryanalyzer" } }, "filter" : { "and" : [{ "geo_distance" : { "distance" : "50mi", "Location" : {"lat":40.838252,"lon":-73.856609} }}] } } } , "SystemUpdateDate": 1424147381785}

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 125

{"took":1,"errors":false,"items":[{"index":{"_index":"jobs","_type":".percolator","_id":"59391","_version":1,"status":201}}]}

Percolate job that is matching with alert that we inserted above. It cannot find any matching queries because of the bug in how ElasticSearch handles doc_type=true

POST http://www.elasticsearchserver.com:9300/jobs/job/_percolate HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Host: www.elasticsearchserver.com:9300
Content-Length: 318
Expect: 100-continue
Connection: Keep-Alive

{"doc":{"JobId":8780558,"JobTitle":"Test","JobDescription":"accommodation","CompanyName":"Test","Salary":"open","RecruiterId":"0526278e021e483caaa127534a220f13","PenaltyFactor":1,"PostedDate":"1289935161080","Country":"US","State":"","Location":{"lat":40.691492,"lon":-73.805689},"SystemUpdateDate":"1424146430141"}}

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 81

{"took":0,"_shards":{"total":4,"successful":4,"failed":0},"total":0,"matches":[]}

2) Correct behavior (doc_value is not specified)

DELETE http://www.elasticsearchserver.com:9300/jobs HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Host: www.elasticsearchserver.com:9300

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 21

{"acknowledged":true}

POST http://www.elasticsearchserver.com:9300/jobs HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Host: www.elasticsearchserver.com:9300
Content-Length: 1785
Expect: 100-continue

{
"settings" : {
"index" : {
"number_of_shards" : 4
},
"analysis" : {
"tokenizer" : {
"pjfpatterntokenizer" : {
"type" : "pattern",
"pattern" : "[^\\p{L}\\p{N}#+.]+|(?<![a-z])#|(?<![a-z+])\\++|\\.(?!net)|(?<=#)|(?<=\\+)(?!\\+)|(?=\\.)|(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)",
"flags" : "CASE_INSENSITIVE"
}
},
"analyzer" : {
"pjfdocumentanalyzer" : {
"tokenizer" : "pjfpatterntokenizer",
"filter" : ["lowercase", "pjfsynonym"]
},
"pjfqueryanalyzer" : {
"type" : "custom",
"tokenizer" : "pjfpatterntokenizer",
"filter" : ["lowercase"]
}
},
"filter" : {
"pjfsynonym" : {
"type" : "synonym",
"ignore_case" : "true",
"expand" : "true",
"synonyms_path" : "postjobfree/pjfsynonyms.txt"
}
}
}
},
"mappings" : {
"job" : {
"properties": {
"JobId": {
"type": "long"
},
"PostedDate": {
"type": "date"
},
"JobTitle": {
"type": "string",
"analyzer" : "pjfdocumentanalyzer"
},
"JobDescription": {
"type": "string",
"analyzer" : "pjfdocumentanalyzer"
},
"CompanyName": {
"type": "string",
"analyzer" : "pjfdocumentanalyzer"
},
"Salary": {
"type": "string",
"analyzer" : "pjfdocumentanalyzer"
},
"Country": {
"type": "string",
"index" : "not_analyzed"
},
"State": {
"type": "string",
"index" : "not_analyzed"
},
"RecruiterId": {
"type": "string",
"index" : "not_analyzed"
},
"Location": {
"type": "geo_point"
},
"SystemUpdateDate": {
"type": "date"
},
"PenaltyFactor": {
"type": "float"
}
}
}
}
}

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 21

{"acknowledged":true}

POST http://www.elasticsearchserver.com:9300/jobs/.percolator/_bulk HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Host: www.elasticsearchserver.com:9300
Content-Length: 416
Expect: 100-continue
Connection: Keep-Alive

{ index : { _index : "jobs", _type : ".percolator", _id : 59391 }}
{"query" : { "filtered" : { "query" : { "query_string" : { "fields" : ["JobTitle","JobDescription","CompanyName","Salary"], "query" : "ACCOMMODATION", "analyzer" : "pjfqueryanalyzer" } }, "filter" : { "and" : [{ "geo_distance" : { "distance" : "50mi", "Location" : {"lat":40.838252,"lon":-73.856609} }}] } } } , "SystemUpdateDate": 1424147381785}

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 125

{"took":1,"errors":false,"items":[{"index":{"_index":"jobs","_type":".percolator","_id":"59391","_version":1,"status":201}}]}

POST http://www.elasticsearchserver.com:9300/jobs/job/_percolate HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Host: www.elasticsearchserver.com:9300
Content-Length: 318
Expect: 100-continue
Connection: Keep-Alive

{"doc":{"JobId":8780558,"JobTitle":"Test","JobDescription":"accommodation","CompanyName":"Test","Salary":"open","RecruiterId":"0526278e021e483caaa127534a220f13","PenaltyFactor":1,"PostedDate":"1289935161080","Country":"US","State":"","Location":{"lat":40.691492,"lon":-73.805689},"SystemUpdateDate":"1424146430141"}}

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 112

{"took":0,"_shards":{"total":4,"successful":4,"failed":0},"total":1,"matches":[{"_index":"jobs","_id":"59391"}]}

As you can see, if doc_value=false then percolator works correctly.

martijnvg commented 9 years ago

@dennisgorelik Sorry for replying so late. The percolator uses a special in-memory index and this index doesn't support doc values. The doc values workaround I shared in #6806 is incorrect.

In theory doc values support could be added to this in-memory index. What is the reason behind using doc values in the percolator? The performance should be better since #7081 got in.

dennisgorelik commented 9 years ago

@martijnvg The only reason why we even tried doc values was your recommendation (that now you are saying is incorrect).

The performance should be better since #7081 got in.

Yes, percolator's performance improved ~33% when we migrated from ElasticSearch 1.2.2 to ElasticSearch 1.4.3.

We are still looking for further performance improvements and improving stability of ElasticSearch percolator under high load. Currently to much stress (multiple queues) for long time causes ElasticSearch to stop producing results entirely. I guess the solution is adding multiple ElasticSearch nodes.

http://stackoverflow.com/questions/28864184/how-to-improve-percolator-performance-in-elasticsearch

martijnvg commented 9 years ago

@dennisgorelik The percolator is CPU intensive and adding more nodes will help to distribute the percolator load better. How much queries do you have stored and how many percolate requests are you executing concurrently?

dennisgorelik commented 9 years ago

We have about 160,000 queries stored. We created 2 shards on our JobIndex that we run percolator against. We have up to 4 concurrent percolate requests. In addition to that there could be other ElasticSearch queries (e.g. searches that are coming from our web server + some other batch searches). All that runs on "Dual Processor Octo Core Xeon 2650 - 2GHz+HT" - that's 32 threads CPU. We have single node ElasticSearch. We use: ES_HEAP_SIZE=6g Totally ElasticSearch consumes up to 12GB on our server.

Occasionally (about once every few days) we are getting ElasticSearch entering into "stupor" mode where ElasticSearch consumes all available CPU (e.g. 90%) with the following errors in the logs:

[2015-03-24 00:08:01,508][WARN ][monitor.jvm              ] [Cancer] [gc][old][416159][31731] duration [34s], collections [1]/[34.5s], total [34s]/[5.6h], memory [5.7gb]->[4.1gb]/[5.8gb], all_pools {[young] [1.4gb]->[11.4mb]/[1.4gb]}{[survivor] [153.4mb]->[0b]/[191.3mb]}{[old] [4.1gb]->[4.1gb]/[4.1gb]}

ElasticSearch reboot helps.

Any suggestions about how can we prevent such ElasticSearch stupor?

martijnvg commented 9 years ago

There can be many reasons why you're running low on heapspace and long gc. It doesn't have to be percolator related. Can you check the node stats api and check if something else is using up the heapspace?

dennisgorelik commented 9 years ago

@martijnvg What kind of node stats are you asking me for? Do you need information like that?

Something like that?

curl -X GET "http://elasticsearchserver:9200/_nodes/stats/jvm?pretty=true"

{
  "cluster_name" : "elasticsearch",
  "nodes" : {
      .....
      "jvm" : {
        "timestamp" : 1427258626635,
        "uptime_in_millis" : 55962324,
        "mem" : {
          "heap_used_in_bytes" : 3744527576,
          "heap_used_percent" : 59,
          "heap_committed_in_bytes" : 6241845248,
          "heap_max_in_bytes" : 6241845248,
          "non_heap_used_in_bytes" : 1371365808,
          "non_heap_committed_in_bytes" : 1805742080,
          "pools" : {
            "young" : {
              "used_in_bytes" : 119134544,
              "max_in_bytes" : 1605304320,
              "peak_used_in_bytes" : 1605304320,
              "peak_max_in_bytes" : 1605304320
            },
            "survivor" : {
              "used_in_bytes" : 637400,
              "max_in_bytes" : 200605696,
              "peak_used_in_bytes" : 200605696,
              "peak_max_in_bytes" : 200605696
            },
            "old" : {
              "used_in_bytes" : 3624755632,
              "max_in_bytes" : 4435935232,
              "peak_used_in_bytes" : 3721644784,
              "peak_max_in_bytes" : 4435935232
            }
          }
        },
        "threads" : {
          "count" : 449,
          "peak_count" : 451
        },
        "gc" : {
          "collectors" : {
            "young" : {
              "collection_count" : 51319,
              "collection_time_in_millis" : 1035200
            },
            "old" : {
              "collection_count" : 966,
              "collection_time_in_millis" : 896951
            }
          }
        },
        "buffer_pools" : {
          "direct" : {
            "count" : 585,
            "used_in_bytes" : 53303365,
            "total_capacity_in_bytes" : 53303365
          },
          "mapped" : {
            "count" : 669,
            "used_in_bytes" : 9273195645,
            "total_capacity_in_bytes" : 9273195645
          }
        }
      }
    }
  }
}

This is node stats api during normal (healthy) operations. Would I need to collect the same information when my ElasticSearch node experience problems?

martijnvg commented 9 years ago

@dennisgorelik Yes, the node stats api, but preferable all the information and when you're seeing the jvm gc warnings. The following request should include all information and show sizes in a human readable way:

curl -XGET "http://elasticsearchserver:9200/_nodes/stats?pretty=true&all&human"

What also can be helpful when jvm problems occur is a jvm heapdump, but you can share that with me privately.

dennisgorelik commented 9 years ago

@martijnvg 1) When our ElasticSearch server hangs again - we'll send you full node stats api response (based on your command). 2) How do I produce jvm heapdump for ElasticSearch in Windows? I tried:

jmap -dump:live,format=b,file=heap.bin 2400

and got:

2400: Access is denied

2400 is PID of elasticsearch-service-x64.exe on our server.

dennisgorelik commented 9 years ago

@martijnvg We got our ElasticSearch server entering half-dead state again. Basic search queries were taking ~15s (instead of usual 0 to 30 ms). Percolation timed out. Removing almost all load from ElasticSearch server for several minutes did NOT help. Only ElasticSearch reboot helped to recover.

Prior to entering half-dead state ElasticSearch was running percolation non-stop for ~6 hours.

See full node stats api result for our ElasticSearch in half-dead state:

{
  "cluster_name" : "elasticsearch",
  "nodes" : {
    "gzJr6VoQSGK7mTr7MjfYGQ" : {
      "timestamp" : 1427771500284,
.....
      "indices" : {
        "docs" : {
          "count" : 2397045,
          "deleted" : 400923
        },
        "store" : {
          "size" : "8.9gb",
          "size_in_bytes" : 9651063464,
          "throttle_time" : "1.2m",
          "throttle_time_in_millis" : 73703
        },
        "indexing" : {
          "index_total" : 792427,
          "index_time" : "39.5m",
          "index_time_in_millis" : 2371584,
          "index_current" : 0,
          "delete_total" : 1917090,
          "delete_time" : "2.2m",
          "delete_time_in_millis" : 135614,
          "delete_current" : 0,
          "noop_update_total" : 0,
          "is_throttled" : false,
          "throttle_time" : "0s",
          "throttle_time_in_millis" : 0
        },
        "get" : {
          "total" : 4777,
          "get_time" : "4.1s",
          "time_in_millis" : 4192,
          "exists_total" : 4777,
          "exists_time" : "4.1s",
          "exists_time_in_millis" : 4192,
          "missing_total" : 0,
          "missing_time" : "0s",
          "missing_time_in_millis" : 0,
          "current" : 0
        },
        "search" : {
          "open_contexts" : 18,
          "query_total" : 9285622,
          "query_time" : "2.9d",
          "query_time_in_millis" : 253663579,
          "query_current" : 8,
          "fetch_total" : 8472373,
          "fetch_time" : "5.6h",
          "fetch_time_in_millis" : 20164442,
          "fetch_current" : 0
        },
        "merges" : {
          "current" : 0,
          "current_docs" : 0,
          "current_size" : "0b",
          "current_size_in_bytes" : 0,
          "total" : 4148,
          "total_time" : "39.2m",
          "total_time_in_millis" : 2353027,
          "total_docs" : 22699407,
          "total_size" : "15.2gb",
          "total_size_in_bytes" : 16400873095
        },
        "refresh" : {
          "total" : 44893,
          "total_time" : "22.2m",
          "total_time_in_millis" : 1332971
        },
        "flush" : {
          "total" : 3138,
          "total_time" : "2.2m",
          "total_time_in_millis" : 137586
        },
        "warmer" : {
          "current" : 0,
          "total" : 87566,
          "total_time" : "10s",
          "total_time_in_millis" : 10033
        },
        "filter_cache" : {
          "memory_size" : "33.4mb",
          "memory_size_in_bytes" : 35027716,
          "evictions" : 20064952
        },
        "id_cache" : {
          "memory_size" : "0b",
          "memory_size_in_bytes" : 0
        },
        "fielddata" : {
          "memory_size" : "33.9mb",
          "memory_size_in_bytes" : 35613020,
          "evictions" : 0
        },
        "percolate" : {
          "total" : 1577725,
          "get_time" : "20.4d",
          "time_in_millis" : 1769513471,
          "current" : 27,
          "memory_size_in_bytes" : -1,
          "memory_size" : "-1b",
          "queries" : 366200
        },
        "completion" : {
          "size" : "0b",
          "size_in_bytes" : 0
        },
        "segments" : {
          "count" : 186,
          "memory" : "20.4mb",
          "memory_in_bytes" : 21488188,
          "index_writer_memory" : "1.7mb",
          "index_writer_memory_in_bytes" : 1863332,
          "index_writer_max_memory" : "595.2mb",
          "index_writer_max_memory_in_bytes" : 624184520,
          "version_map_memory" : "24.8kb",
          "version_map_memory_in_bytes" : 25396,
          "fixed_bit_set" : "0b",
          "fixed_bit_set_memory_in_bytes" : 0
        },
        "translog" : {
          "operations" : 3439,
          "size" : "20.6kb",
          "size_in_bytes" : 21152
        },
        "suggest" : {
          "total" : 0,
          "time" : "0s",
          "time_in_millis" : 0,
          "current" : 0
        },
        "query_cache" : {
          "memory_size" : "0b",
          "memory_size_in_bytes" : 0,
          "evictions" : 0,
          "hit_count" : 0,
          "miss_count" : 0
        }
      },
      "os" : {
        "timestamp" : 1427771501003,
        "uptime" : "9.4m",
        "uptime_in_millis" : 568846,
        "cpu" : {
          "sys" : 0,
          "user" : 22,
          "idle" : 77,
          "usage" : 22,
          "stolen" : 0
        },
        "mem" : {
          "free" : "909.6mb",
          "free_in_bytes" : 953839616,
          "used" : "31gb",
          "used_in_bytes" : 33370476544,
          "free_percent" : 3,
          "used_percent" : 96,
          "actual_free" : "1002.3mb",
          "actual_free_in_bytes" : 1051021312,
          "actual_used" : "30.9gb",
          "actual_used_in_bytes" : 33273294848
        },
        "swap" : {
          "used" : "30.5gb",
          "used_in_bytes" : 32848777216,
          "free" : "16.4gb",
          "free_in_bytes" : 17689870336
        }
      },
      "process" : {
        "timestamp" : 1427771501018,
        "open_file_descriptors" : 5960,
        "cpu" : {
          "percent" : 0,
          "sys" : "5.2m",
          "sys_in_millis" : 312838,
          "user" : "7.3s",
          "user_in_millis" : 7335,
          "total" : "5.3m",
          "total_in_millis" : 320173
        },
        "mem" : {
          "resident" : "2.1gb",
          "resident_in_bytes" : 2335469568,
          "share" : "-1b",
          "share_in_bytes" : -1,
          "total_virtual" : "1.8gb",
          "total_virtual_in_bytes" : 1965916160
        }
      },
      "jvm" : {
        "timestamp" : 1427771501018,
        "uptime" : "6.5d",
        "uptime_in_millis" : 568852806,
        "mem" : {
          "heap_used" : "5.6gb",
          "heap_used_in_bytes" : 6104360960,
          "heap_used_percent" : 97,
          "heap_committed" : "5.8gb",
          "heap_committed_in_bytes" : 6241845248,
          "heap_max" : "5.8gb",
          "heap_max_in_bytes" : 6241845248,
          "non_heap_used" : "568.1mb",
          "non_heap_used_in_bytes" : 595737264,
          "non_heap_committed" : "1.1gb",
          "non_heap_committed_in_bytes" : 1272410112,
          "pools" : {
            "young" : {
              "used" : "1.4gb",
              "used_in_bytes" : 1605304320,
              "max" : "1.4gb",
              "max_in_bytes" : 1605304320,
              "peak_used" : "1.4gb",
              "peak_used_in_bytes" : 1605304320,
              "peak_max" : "1.4gb",
              "peak_max_in_bytes" : 1605304320
            },
            "survivor" : {
              "used" : "60.2mb",
              "used_in_bytes" : 63160368,
              "max" : "191.3mb",
              "max_in_bytes" : 200605696,
              "peak_used" : "191.3mb",
              "peak_used_in_bytes" : 200605696,
              "peak_max" : "191.3mb",
              "peak_max_in_bytes" : 200605696
            },
            "old" : {
              "used" : "4.1gb",
              "used_in_bytes" : 4435934656,
              "max" : "4.1gb",
              "max_in_bytes" : 4435935232,
              "peak_used" : "4.1gb",
              "peak_used_in_bytes" : 4435935232,
              "peak_max" : "4.1gb",
              "peak_max_in_bytes" : 4435935232
            }
          }
        },
        "threads" : {
          "count" : 471,
          "peak_count" : 474
        },
        "gc" : {
          "collectors" : {
            "young" : {
              "collection_count" : 556417,
              "collection_time" : "2.1h",
              "collection_time_in_millis" : 7611784
            },
            "old" : {
              "collection_count" : 34576,
              "collection_time" : "5.5h",
              "collection_time_in_millis" : 20125610
            }
          }
        },
        "buffer_pools" : {
          "direct" : {
            "count" : 1127,
            "used" : "61.9mb",
            "used_in_bytes" : 64991049,
            "total_capacity" : "61.9mb",
            "total_capacity_in_bytes" : 64991049
          },
          "mapped" : {
            "count" : 766,
            "used" : "8.9gb",
            "used_in_bytes" : 9608816056,
            "total_capacity" : "8.9gb",
            "total_capacity_in_bytes" : 9608816056
          }
        }
      },
      "thread_pool" : {
        "percolate" : {
          "threads" : 32,
          "queue" : 0,
          "active" : 27,
          "rejected" : 0,
          "largest" : 32,
          "completed" : 85311
        },
        "bench" : {
          "threads" : 0,
          "queue" : 0,
          "active" : 0,
          "rejected" : 0,
          "largest" : 0,
          "completed" : 0
        },
        "listener" : {
          "threads" : 10,
          "queue" : 0,
          "active" : 0,
          "rejected" : 0,
          "largest" : 10,
          "completed" : 32291
        },
        "index" : {
          "threads" : 32,
          "queue" : 0,
          "active" : 0,
          "rejected" : 0,
          "largest" : 32,
          "completed" : 84
        },
        "refresh" : {
          "threads" : 8,
          "queue" : 0,
          "active" : 0,
          "rejected" : 0,
          "largest" : 8,
          "completed" : 44873
        },
        "suggest" : {
          "threads" : 0,
          "queue" : 0,
          "active" : 0,
          "rejected" : 0,
          "largest" : 0,
          "completed" : 0
        },
        "generic" : {
          "threads" : 2,
          "queue" : 0,
          "active" : 0,
          "rejected" : 0,
          "largest" : 7,
          "completed" : 69713
        },
        "warmer" : {
          "threads" : 3,
          "queue" : 0,
          "active" : 0,
          "rejected" : 0,
          "largest" : 4,
          "completed" : 42844
        },
        "search" : {
          "threads" : 96,
          "queue" : 0,
          "active" : 8,
          "rejected" : 0,
          "largest" : 96,
          "completed" : 20079486
        },
        "flush" : {
          "threads" : 5,
          "queue" : 0,
          "active" : 0,
          "rejected" : 0,
          "largest" : 5,
          "completed" : 43953
        },
        "optimize" : {
          "threads" : 0,
          "queue" : 0,
          "active" : 0,
          "rejected" : 0,
          "largest" : 0,
          "completed" : 0
        },
        "management" : {
          "threads" : 5,
          "queue" : 22,
          "active" : 5,
          "rejected" : 0,
          "largest" : 5,
          "completed" : 344629
        },
        "get" : {
          "threads" : 32,
          "queue" : 0,
          "active" : 0,
          "rejected" : 0,
          "largest" : 32,
          "completed" : 4778
        },
        "merge" : {
          "threads" : 5,
          "queue" : 0,
          "active" : 0,
          "rejected" : 0,
          "largest" : 5,
          "completed" : 46789
        },
        "bulk" : {
          "threads" : 32,
          "queue" : 0,
          "active" : 0,
          "rejected" : 0,
          "largest" : 32,
          "completed" : 42574
        },
        "snapshot" : {
          "threads" : 0,
          "queue" : 0,
          "active" : 0,
          "rejected" : 0,
          "largest" : 0,
          "completed" : 0
        }
      },
      "network" : {
        "tcp" : {
          "active_opens" : 125109,
          "passive_opens" : 1728257,
          "curr_estab" : 706,
          "in_segs" : 911340183,
          "out_segs" : 1034136815,
          "retrans_segs" : 767107,
          "estab_resets" : 429327,
          "attempt_fails" : 128973,
          "in_errs" : 928,
          "out_rsts" : 413521
        }
      },
      "fs" : {
        "timestamp" : 1427771501018,
        "total" : {
          "total" : "744.9gb",
          "total_in_bytes" : 799925071872,
          "free" : "510.4gb",
          "free_in_bytes" : 548089225216,
          "available" : "510.4gb",
          "available_in_bytes" : 548089225216,
          "disk_reads" : 50198038,
          "disk_writes" : 131904978,
          "disk_io_op" : 182103016,
          "disk_read_size" : "1.1gb",
          "disk_read_size_in_bytes" : 1199022592,
          "disk_write_size" : "298.8mb",
          "disk_write_size_in_bytes" : 313397248,
          "disk_io_size" : "1.4gb",
          "disk_io_size_in_bytes" : 1512419840,
          "disk_queue" : "0"
        },
        "data" : [ {
          "path" : "F:\\elasticsearch-1.4.3\\data\\elasticsearch\\nodes\\0",
          "mount" : "F:\\",
          "dev" : "F:\\",
          "total" : "744.9gb",
          "total_in_bytes" : 799925071872,
          "free" : "510.4gb",
          "free_in_bytes" : 548089225216,
          "available" : "510.4gb",
          "available_in_bytes" : 548089225216,
          "disk_reads" : 50198038,
          "disk_writes" : 131904978,
          "disk_io_op" : 182103016,
          "disk_read_size" : "1.1gb",
          "disk_read_size_in_bytes" : 1199022592,
          "disk_write_size" : "298.8mb",
          "disk_write_size_in_bytes" : 313397248,
          "disk_io_size" : "1.4gb",
          "disk_io_size_in_bytes" : 1512419840,
          "disk_queue" : "0"
        } ]
      },
      "transport" : {
        "server_open" : 13,
        "rx_count" : 6,
        "rx_size" : "1.3kb",
        "rx_size_in_bytes" : 1356,
        "tx_count" : 6,
        "tx_size" : "1.3kb",
        "tx_size_in_bytes" : 1356
      },
      "http" : {
        "current_open" : 38,
        "total_opened" : 5428
      },
      "breakers" : {
        "request" : {
          "limit_size_in_bytes" : 2496738099,
          "limit_size" : "2.3gb",
          "estimated_size_in_bytes" : 32880,
          "estimated_size" : "32.1kb",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "fielddata" : {
          "limit_size_in_bytes" : 3745107148,
          "limit_size" : "3.4gb",
          "estimated_size_in_bytes" : 35616584,
          "estimated_size" : "33.9mb",
          "overhead" : 1.03,
          "tripped" : 0
        },
        "parent" : {
          "limit_size_in_bytes" : 4369291673,
          "limit_size" : "4gb",
          "estimated_size_in_bytes" : 35649464,
          "estimated_size" : "33.9mb",
          "overhead" : 1.0,
          "tripped" : 0
        }
      }
    }
  }
}

martijnvg commented 9 years ago

@dennisgorelik I see that you in total have 366200 queries stored. That itself may take a big part of of your jvm heap space. What kind of queries are you storing in the percolator? (anything with a wildcard?) I assume that during this 6 hours you also add new percolator queries to ES?

The percolator parses the queries and keeps the parsed variant of the queries (Lucene queries) in the jvm heapspace. Some queries use more memory than others. But regardless of that a node can only hold a finite amount of queries. So you at some point you need to add another node. Also in your case giving your current node more jvm heap space will also help.

dennisgorelik commented 9 years ago

@martijnvg 1) We have 150,909 abandoned job percolators (after we moved job percolators into percolatejobs index that is separate from underlying jobs index). We do not run any queries against these abandoned percolators. Do these abandoned percolators still use ElasticSearch memory?

a node can only hold a finite amount of queries

How many percolator queries can ElasticSearch node hold? Or put it another way: how much memory does typical percolator query (without wildcards) takes from jvm heap?

2) We do not have any wild card queries at all. Here is a typical example of percolator queries we have:

 "_source":{"query" : { "filtered" : { "query" : { "query_string" : { "fields" : ["JobTitle","JobDescription","CompanyName","Salary"], "query" : "\"QA.QC\" AND MANUFACTURE", "analyzer" : "pjfqueryanalyzer" } }, "filter" : { "and" : [{ "geo_distance" : { "distance" : "25mi", "Location" : {"lat":40.506772,"lon":-74.265423} }}] } } } , "SystemUpdateDate": 1406362400409}

3) Yes, during 6 hours of percolation we had some percolator queries being added. But not many. Probably less than 100 new queries. Overall ElasticSearch server was running for 6 days since we had to reboot it last time.

clintongormley commented 9 years ago

@dennisgorelik I see you have swap enabled. Do you have mlockall enabled as well? You can check with:

GET _nodes/process

If not, then this could be the cause of your slow garbage collections. See http://www.elastic.co/guide/en/elasticsearch/reference/current/setup-configuration.html#setup-configuration-memory for more

dennisgorelik commented 9 years ago

@clintongormley

We are running ElasticSearch on Windows Server 2012. Unfortunately the article you suggested only describes how to disable memory swap on Linux.

Do you know how to disable swap for ElasticSearch on Windows? That question left unanswered on StackOverflow for 8 months: http://stackoverflow.com/questions/25297682/how-to-avoid-elasticsearch-process-being-swapped-on-windows-servers

In elasticsearch.yml we set:

bootstrap.mlockall: true

However under Windows ElasticSearch ignores that setting and has mlockall=false:

GET _nodes/process

{
   "cluster_name": "elasticsearch",
   "nodes": {
      "XF2RV7JLSvOVpK7QAVPv9Q": {
         "name": "Armand Martel",
         "transport_address": "inet[/xxx.xxx.xxx.xxx:9300]",
         "host": "sv7731",
         "ip": "xxx.xxx.xxx.xxx",
         "version": "1.4.3",
         "build": "36a29a7",
         "http_address": "inet[/xxx.xxx.xxx.xxx:9200]",
         "process": {
            "refresh_interval_in_millis": 1000,
            "id": 24116,
            "max_file_descriptors": -1,
            "mlockall": false
         }
      }
   }
}

clintongormley commented 9 years ago

@dennisgorelik we're working on it for windows (see https://github.com/elastic/elasticsearch/pull/9186) but you can always disable the page file on windows for now.

dennisgorelik commented 9 years ago

@clintongormley 1) Do you mean disable the page file for all programs in Windows? http://lifehacker.com/5426041/understanding-the-windows-pagefile-and-why-you-shouldnt-disable-it

2) When do you think VirtualLock will be supported on Windows? Couple of months? A year?

clintongormley commented 9 years ago

1) Do you mean disable the page file for all programs in Windows? http://lifehacker.com/5426041/understanding-the-windows-pagefile-and-why-you-shouldnt-disable-it

Yes. If all of you've got running is Elasticsearch, then the advice from that article doesn't apply.

When do you think VirtualLock will be supported on Windows?Couple of months? A year?

Hopefully soon. We've been waiting for feedback from Windows experts, but I think we now have enough info to proceed.

dennisgorelik commented 9 years ago

@clintongormley Aside of ElasticSearch we are running other programs on the same server (IIS, SQL Server). It probably does not make sense to run single ElasticSearch on Windows platform. If/when we move ElasticSearch into separate server then it would be a Linux server (less frequent reboots, no licensing fees, better configuration support in ElasticSearch).

Thank you for your reply and looking forward for VirtualLock support on Windows.

FrederickManley commented 9 years ago

Disabling the page file and setting mlockall: true in the yml doesn't fix the issue on windows. mlockall is still set to false (/_nodes/process) and performance is still poor. swap space used is basically the entire remaining memory of the node (50%).

dennisgorelik commented 9 years ago

We moved our ElasticSearch job percolation functionality from Windows server to ElasticSearch cluster on two Linux VPS-es (3GB RAM + 2GB RAM). Percolation performance improved at a fraction of hosting price (relative to price of dedicated Windows server). We can also increase cluster performance just by adding more nodes to our ElasticSearch cluster.

More details here

clintongormley commented 9 years ago

@FrederickManley mlockall is linux only. See https://github.com/elastic/elasticsearch/pull/9186 for support for something similar coming in Windows

clintongormley commented 8 years ago

The original issue appears to have been fixed. Closing

elastic / elasticsearch