Closed JagathJayasinghe closed 4 years ago
Can you please provide more information that would help us to reproduce the problem you are facing? It would help to get the exact query you are running (including the indices that are hit), how your data looks like, possibly the mapping and how your index process looks like.
Ideally, can you provide a minimal example? Just a few documents that, if indexed, will exhibit the mentioned problem?
@cbuescher Thats the problem, we are unable to reproduce it. Though we are worried it will occur again.
We were targeting 2 types in the query.. e.g. http://node01:9200/my_index/my_type,my_other_type/_search
What we observed in the screen shots above was that it would randomly return incorrect totals for same query executed multiple times.
With the explain options set in the query we saw 1 particular node would always return the wrong totals so we restarted this node but the issue continued.
Only when we had restarted all nodes did the problem disappear.
We don't know if it was some corruption in the shards of 1 node or what. As mentioned we didn't see the same issue using _msearch and _count apis with same query.
@cbuescher Would you be able to suggest what additional information that we should collect, in order to be able to reproduce the error if we face the same issue again?
As mentioned earlier, anything that helps reproducing this on our side: how does your data look look like, what mappings are you using, what's your ES setup, how do you index, how does the query look like that exibits this problem.
Information provided here would give you some idea about the query, mappings, data, indexing and ES setup and please note that the examples provided here have been greatly simplified.
example query : Query listed below exhibited the issue while the problem was still there and same query is now producing the expected correct result (after rolling restart of the cluster resolved the issue)
http://node01:9200/my_index/product/_search
{ "size": 10, "query": { "ids": { "values": [183307729] } } }
mappings for the type “product” :
mapping.txt
How data looks like: product.txt
Indexing process and the ES setup:
elasticsearch.yml (elasticsearch.txt) file attached here is from one of the nodes in the cluster of 9 nodes.
Indexing to the cluster is performed using the HTTP bulk API with a default batch size of 1,000 items with a peak rate of around 2,000 per second.
The issue has occurred again, after only 1 month of uptime.
** Correction - _msearch does exhibit the same issue. We use our userIds with the _msearch as the preference and while 1 might consistently work... another will consistently fail.
Also adding this preference to the _search query e.g. /product/_search?preference=2906 results in 1 preference returning correct results consistently and another consistently failing.
We had to perform a rolling restart of all nodes again to fix the issue.
@cbuescher Is there anything more we can check if/when it happens again ? APIs to find the shards a specific preference targets etc... to see if we can narrow down which shards/nodes are causing the issue ?
Sorry for the delay, just looking into this again, this seems really odd but hard to debug without reproduction.
Just to clarify a bit: For a query like { "size": 10, "query": { "ids": { "values": [183307729] } } }
you would expect to get back exactly one document and a total.hits
of 1? But you are getting more than that?
Is the ProductId
the same as the document id you index the main document with?
When this happens, does it happen for all ids or just very few? As long as you don't restart, does the problem reproduce with that same id?
What happens if instead of using the ids
query or a terms query on the _id
field you use the main document ProductId
?
My suspicion is this might have something to do with the nested documents and merging, but have no good idea so far what is happening or how to explain this.
Another question I forgot to ask: What do you need the result of hits.total
for? You mentioned _count
as not exhitibing the same behaviour, is that still true as far as you know?
Let me answer your questions in turn,
1) Yes, for a query like { "size": 10, "query": { "ids": { "values": [183307729] } } } we expect to get back one document and hits.total of 1. But when this issue occurs we get an incorrect hits.total value. Sometimes it returns 1 (which is correct value) other times a much higher value, this behaviour just feels like round-robin of shards.
2) ProductId and the document _id are the same.
3) When this happens the problem occurs for all ids we've randomly spot checked.
4) We don't currently index the ProductId field.
5) We use hits.total in side one of our web applications to make a business logic decision, _counts API still returns the correct count even when the incident occurs.
Thanks for the feedback, this is indeed a weird behaviour. Just a few follow up questions:
explain
parameter set to true and post the output so we see the actual Lucene query that gets executed?java -cp lib/lucene-core-6.4.1.jar org.apache.lucene.index.CheckIndex -fast <Path-To-Index>
in your Elasticsearch home directory, where Path-To-Index
would be absolute path to the index segments in your data directory, so usually something like data/nodes/0/indices/<indexUUID>/<shardNo>/index/
. The indexUUI can be obtained running the _cat/indices
endpoint, and you should try all available shards on that node and see if the tool detects any problemsOne thing I forgot mentioning about CheckIndex
: since its a very low level Lucene tool that comes without warranty, you should make a backup of your index before running it. More in the JavaDocs.
I changed the titel to make the issue a bit more specific, still not sure if this is a bug or not but let's treat is as one for now.
No, we migrated from 1.X to 5.X using the _reindex API into new incidies.
All 9 nodes in the cluster are running 5.3.1
We will perform this test when the incident occurs again, it's happened twice in the last 7 days
Good information on on checking the corruption of the index, we'll give this a go after making a backup
Experiencing the same issue again,
Please see the queries and their results listed below. Would this information help ? 1) http://node01:9200/my_index/my_product/_search?preference=3677 { "size": 10, "query": {"ids": { "values": [105226165] }}}
This query consistently returns incorrect hits.total value.
2) Executing the same query using Validate API with the explain parameter set to true
http://node01:9200/my_index/my_product/_validate/query?explain=true {"query": {"ids": { "values": [105226165] }} }
Result: {"valid": true, "_shards": { "total": 1,"successful": 1,"failed": 0}, "explanations": [ { "index": "my_index", "valid": true, "explanation": "+_uid:my_product#105226165 #(#_type:my_product)" }]}
3) Executing the same query using Validate API with the explain parameter set to true and rewrite set to true http://node01:9200/my_index/my_product/_validate/query?explain=true&rewrite=true {"query": {"ids": { "values": [105226165] }}}
Result: {"valid": true,"_shards": {"total": 1,"successful": 1,"failed": 0}, "explanations": [ { "index": "my_index", "valid": true, "explanation": "+ConstantScore(_uid:my_product#105226165) #(ConstantScore(_type:my_product))^0.0" }]}
Hi, another question that might narrow down my initial suspicion this might have something to do with the nested "SellerProducts" documents. For the above document in question (105226165), do you know if and how many nested document it currently has? Maybe it had some that got deleted? Does this number come close to the wrong "total.hits" you are seeing? This is a wild quess but since the querys run in the "_uid" field it might be worth trying to narrow things down here a little.
Thank you for your response,
It has multiple nested documents but hits.total and the number of nested documents are not the same. These numbers are way off and doesn't come close (hits.total was 290 whereas the expected number was 1).
@JagathJayasinghe thanks for your patience and bearing with me. I talked to @martijnvg who is more familiar with nested docs and we are still leaning in that direction with hypothesis of what might be wrong.
A few more questions:
product
in the mapping that you shared, but in the validate request you query type my_product
. Is that the same mapping?Sorry that we're still poking around in the dark a little on this, there's something fishy here but its hard to put the finger to it.
Thanks,
maybe be possible for us to try closing/opening the index but it would require a maintenance window as there are several
okay, so that's not practical then, also this was only out of interest, no concrete hint in mind
I talked to @jimczi in the meantime regarding the explain output you posted and he thinks it looks okay (the nested filter is omitted because there is a filter on _type). Another thing to recheck, just to be sure:
When the issue first happened we did use the _search_shards API with a successful call and an unsuccessful call of the same query in the hopes of identifying a faulty node/shard. We thought we had identified the faulty node through the order they were listed in the responses. I think we restarted the first one that was out of order in the faulty call… but it didn’t fix the problem so we went and did a rolling restart of all nodes. Which we have been doing since.
Is there some other APIs we should use to help us identify the nodes/shards a certain preference is targeting next time this happens ?
Thank you.
When the issue first happened we did use the _search_shards API with a successful call and an unsuccessful call of the same query in the hopes of identifying a faulty node/shard.
As far as I can see, if you use _search_shard
with a fixed preference=XYZ
value you should get the node/shard that would be selected with preference first when you would run an actual search with that value. So if your ids
query behaves weird with preference=x
but works well with preference=y
, that should tell you which shards on which nodes are used and check if restarting only those helps (at least for that id, docs with other ids might live on another shard).
Also setting "explain" : true
in your search request should tell you _shard
and _node
for every hit, so if you use that with a good and a bad preference
parameter then you probably see a difference.
Btw. how many shards and replicas are configured for the index in question? I though I'd already asked that and assumed a "standard" setup but just rechecked and couldn't find anything.
The issue has happened again. So we ran our query with 2 preferences:
Preference 2904 works and correctly returns "total": 1 Query: POST my_index/my_type%2Cmy_other_type/_search?preference=2904&explain=true { "query": { "ids": { "values": [111637091] }}} Response: Response.txt
The source has 43 nested documents or 1 type and 9 nested documents of another type.
Preference 2906 fails and returns "total": 595 Query: POST my_index/my_type%2Cmy_other_type/_search?preference=2906&explain=true {"query": { "ids": {"values": [111637091] } }} Response: Response.txt
We tried just restarting the node identified by the failed query but this did not fix the issue. While the node was recovering, the query switched to another node but once recovered it switched back to the original node but continued to fail. We restarted the node it switched to and tried again… but it still failed… Then we restarted the final node with shard 0 and again it still failed… So we restarted all nodes to clear the issue.
Please note that my_type and my_other_type used in this simplified query/response version are using same "product" mapping.
Thank you.
@JagathJayasinghe can you share how many shards and replicas are configured for the index in question? Apologies if this is already somewhere in the ticket, I couldn't find anything.
There are 6 shards and 2 replicas.
Thanks, lets take a step back here. I didn't focus on the fact that you querying across two types so far, can you elaborate on this a bit more? This might come into play with the nested docs probably. As far as I see in this issue, you have one index (here "my_index", probably different name in prod) and two types ("my_type", "my_other_type"). The mapping you provided in https://github.com/elastic/elasticsearch/issues/25603#issuecomment-315796289, that is the mapping for one type ("product").
The source has 43 nested documents or 1 type and 9 nested documents of another type.
Thank you for the response. Please see explanations below in turn,
Yes, the mapping for both types is exactly the same. The document Ids should be unique across both types.
So there are 43 nested documents of nestedType and 9 of otherNestedType
Yes, there is 1 parent document with id “111637091” in this case it is of type my_other_type. It has 2 properties of nested documents: “nestedTypeProp”:{ “dynamic”: “false”, “type”:”nested” “properties”:{…} }, otherNestedTypeProp :{ “dynamic”: “false”, “type”:”nested” “properties”:{…} }
Preference 2906 fails and returns "total": 595 Query: POST my_index/my_type%2Cmy_other_type/_search?preference=2906&explain=true {"query": { "ids": {"values": [111637091] } }}
would you be able to run the query without the type filter ie.:
POST my_index/_search?preference=2906&explain=true
and share the result?
Problem is this issue is not occurring at the moment on our production cluster but it just so happens that today I noticed we were experiencing the same issue on our Dev cluster so I have been playing with the various APIs to see if I can find out anything new. Again querying for a single Id.
Running the query with the &explain=true only gives us the explanation of the node that successfully returned the 1 hit. response.txt Whether run with or without the type filters I was getting the same hits total of 598.
Using the ?preference=_shards:n parameter I ran the query against each of the 9 shards we have on 3 nodes in Dev individually. Node1 correctly returned 1 hit and total:1 for 1 of its shards and 0 hits and total:0 for the other shards on that node. Node2 returned no hits but totals of 95 for all of its 3 shards Node3 returned no hits but totals of 104 for all of its 3 shards
(1) + (95 3) + (104 3) => 598
Again I tried the _count API but with the ?preference=_shards:n for each individual shard and it still returns the correct counts for all shards.
Is there something more we can try to find out whats going on in the 2 problem nodes we currently have in our Dev cluster ?
Incidently our Dev cluster is running v5.2.2, whereas our LIVE cluster is v5.3.1
@dannymurphygfk, just to make sure, because @JagathJayasinghe was talking about 6 shards and 2 replicas before, I assume this was production. How's the situation on the dev cluster that you mentioned now? How many of those 9 shards are primaries, how is their distribution over the three nodes?
@cbuescher Yes Jagath was talking about the Production Cluster. Our Dev cluster has 9 shards 0 replicas, 3 shards on each node.
One thing I haven't asked at all so far is how you index your documents. Do you use any costumized routing
by any chance? How about updates to parent documents, e.g. when nested documents are deleted or added?
@cbuescher We use the Bulk indexing API to index and no we don't use any customized routing.
When nested documents are added/updated/deleted the entire parent document is involved ? We are using the nested model NOT the parent-child relationship.
I think we might have confused you maybe by previously calling our nested documents 'types' or something ?
When nested documents are added/updated/deleted the entire parent document is involved ? We are using the nested model NOT the parent-child relationship.
Yes, understood. I just wanted to exclude the chance of weird custom routing issues from the list of possible causes.
We are not restarting the nodes on our Dev cluster as we are hoping it can help us troubleshoot the problem. Don't know it it will help much, but we noticed that we can create a new index, with the default settings for shards etc.. and if we query the index even before adding any documents we get bogus hit totals ?
{ "acknowledged": true, "shards_acknowledged": true }
{ "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 294, "max_score": 0, "hits": [] } }
{ "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 407, "max_score": 0, "hits": [] } }
Like I said our Dev cluster is still currently in this state so if there are any other APIs we can call to help troubleshoot it please let us know.
we can create a new index, with the default settings for shards
Thanks, its great if you can keep the dev cluster in the "problematic" state, and the fact that this reproduced without any data points to a different direction than my previous suspision about nested document I think, although that behaviour on a completely emty index is super strange as well. So just to sum up and double check, correct me if I'm wrong:
A few questions again:
And thanks for sticking with this puzzling thing for so long.
We have 3 nodes in the Dev cluster
The new index defaulted to 5 shards and 1 replica
Yes, we are performing an empty search and yes we see total hits coming back.
Yes I ran 2 searches because I was seeing 2 different counts coming back alternately. (this morning this search is alternating between totals 199 and 398)
Running the search with preference=_shards I get
{ "took": 11, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 0, "max_score": null, "hits": [] } }
{ "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 0, "max_score": null, "hits": [] } }
{ "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 0, "max_score": null, "hits": [] } }
{ "took": 3, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 95, "max_score": 0, "hits": [] } }
{ "took": 4, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 104, "max_score": 0, "hits": [] } }
And the distribution of the shards is as follows:
Actually when I rerun the search with the _shards preference again, its different shards that return the wrong totals.
Can you explain how I check the empty index shards for corruption ?
Can you explain how I check the empty index shards for corruption ?
To check index shards for corruption you can use a command line tool build into Lucene and run it on the Lucene index that contains the files for that shard. For that, go to the affected node, change to the ES home directory (depending on your installation). There should be a lib
folder that contains the lucene-core-6.4.2.jar (the version might be different, this is for 5.3.3). Next you need to know which directory the shard data is in. Figure out the index id of the affected index using the "_cat/indices" api (e.g. its rX1iHVUXSw6z7nYNSWzQ3A
for a test on my local machine. The directory you want to check then is $YOUR_DATA_DIR/nodes/$NODE/indices/$INDEX-ID/$SHARD-ID/index
where $YOUR_DATA_DIR
is the data directory configured in your ES config file, $NODE
should be 0 if you are only running one node on each machine, and $SHARD-ID the shard number you want to check (e.g. 4 in the above example where shard 4 seems to do quirky things).
It seems like you need to close the index before running the checker, otherwise you get a LockObtainFailedException. Then run:
java -cp ./lib/lucene-core-6.4.2.jar org.apache.lucene.index.CheckIndex -fast $YOUR_DATA_DIR/nodes/$NODE/indices/$INDEX-ID/$SHARD-ID/index
The tool should give you a report telling you if there are any issues. I'd try running it at least on both nodes that contain shard 4 of the above empty index. Now that the issue also reproduces on your dev environment, can you also try to close the "real" test index and try to run the checker on all the shards and report if it turns up anything of interest? I also asked initially if the problem with the wrong counts persists after closing/re-opening an index, that would be interesting to see if this is possible to do on the dev-cluster index now as well.
I ran the tool against all shards for the empty index... all report "No problems were detected with this index."
Closing the empty index and reopening didn't fix the issue either.
I even ran the tool on the "real" test index against all of it shards, it logs how it checks each segment etc... but the outcome was still "no problems were detected with this index".
I don't see anything else remarkable in the report but have attached the output from node one just in case. node1Shards.txt
Closing the this index and reopening didn't fix the issue.
I dunno if we already mentioned it before but specifying the "size" appears to have an impact on the results. If we specify size 0 we get the correct totals... but if we specify a size > 0 we get incorrect totals, both for our "real" index and our empty index.
e.g. with the empty index: POST my_index/_search { "size": 0 } vs POST my_index/_search { "size": 1 }
Hi @dannymurphygfk,
now that you are able to reproduce the issue on an empty index, would it be possible to provide us a copy of empty index shards? I'd like to see if I can recreate the problem locally using your faulty empty index shards. I think it should work if you zip the empty index ids data directory on each of the three nodes of your test cluster and attach it to this issue. I'm not sure at all this will allow me to reproduce the issue but I think its worth a try. Let me know if this is possible for you.
As requested the copy of the shards from the empty index. my_index.zip
Thanks for the copy of the shards, I imported them into a local three node 5.2.2 cluster (I think thats the version you're running in Dev), but unfortunately I'm not able to reproduce the issue like this. So the "bad" state that your cluster is in when returning the wrong "hits.total" is probably an in-process thing. This would also explain why closing/opening the index doesn't make the issue disappear, but a restart of the nodes does.
I dunno if we already mentioned it before but specifying the "size" appears to have an impact on the results. If we specify size 0 we get the correct totals... but if we specify a size > 0 we get incorrect totals, both for our "real" index and our empty index.
I think this is the most interesting hint we currently have, but I don't have an idea yet how to dig into this behaviour further. Thanks for bearing with me, let me know if you have any new information on your side.
Sorry for the long silence here, just wanted to let you know I'm still curious about this but currently have no good ideas of how to proceed without being able to reproduce this on our side. But it's still on the radar, let me know if you have anything new or you don't see the issue any more (maybe in a different environment)...
Hi @cbuescher ,
We have upgraded to v5.6.4 last week so will let you know if it happens again on this version.
After posting yesterday about having upgraded... unfortunately we noticed the same problem happening again. It took a full cluster rolling restart to sort it.
So we noticed the issue 6 days ago.... so we performed a rolling restart to clear. The issue has been noticed again just now so we are yet again performing a rolling restart to clear.
Just in case this has any bearing, we do have 2 clusters a Head and a Tail Both clusters data get indexed by the same process using the bulk indexing apis.
The issue is only happening on our Head cluster.
What is different about our Head cluster is:
Also might have some bearing, we periodically notice 1 or more nodes on our Head cluster consuming too much memory and crashing or getting stuck in a broken state and requiring manual intervention to recover. This is also something we are investigating.
Data older than 6 months is removed periodically. Mappings are slightly different as the use case for querying each cluster is slightly different. We use custom Native scripts for filtering and sorting.
These might be interesting differences. Can you share a bit more e.g. about how the deletion happens (e.g. with which frequency, any querys/scripts involved). Same for the mappings, what are those slight differences? As for the scripted filtering/sorting, do you also only use these on the Head cluster? If so, can you also share those?
The deletions are done through a Cron job that runs daily at 9pm.
/usr/bin/curl --silent -X POST -d '{"query":{"range":{"LastUpdated":{"lt":"now-180d/d"}}}}' http://node1:9200/my_index/my_type,my_other_type/_delete_by_query 2>&1 |/usr/bin/logger -t cleanup
The differences in the mappings are more to avoid indexing properties that we don't need to query by in either cluster. e.g. in Our Head cluster we index a property which we later use in the filtering/sorting scripts.
"MySortingMetaProperty":{ "type":"keyword" }
This field is not indexed in the Tail cluster.
Yes the scripts are only used by our Head cluster, they are quite involved though... not sure I can share them in their entirety.
E.g. some of the scripts use the "MySortingMetaProperty". This property contains an array of slugs (CSV) of the important information in the nested documents we spoke about above. Our script parses this information and processes to provide a sort value for the main document.
Just noticed while Cluster is in this bad state, if I query for an Id that does NOT exist I also get a bogus total value.
POST my_index/my_type%2Cmy_other_type/_search?preference=2906 {"query": { "ids": {"values": [1] } }}
There are no documents of my_type or my_other_type with Id 1 yet its still returned total of 296.
I don't know if it will help much, but it looks like this issue may be happening more often than we originally thought.
We setup an hourly scheduled task to query 3 or 4 Ids including the id 1 which doesn't exist. It queries the same ids with 20 different preference numbers. The idea was to identify early when this issue begins.
What we experienced over the xmas period is as follows:
15th Dec - Product id 1 started returning bogus totals later that day 1 node (node 5) was restarted for other reasons and it appeared the issue was fixed until 17th Dec - Again product id 1 started returning bogus totals 2 nodes (nodes 4 and 5) were restarted for other reasons and again it appeared the issue was fixed until 21st Dec - Again product id 1 started returning bogus totals 23rd Dec - 1 node (node 4) went down and was was restarted - appeared to fix the issue until 25th Dec - Again product id 1 started returning bogus totals
2 nodes (node 4 and 9) have been restarted since but the issue persisted this time and is still present.
Right now we are looking at a workaround for the broken feature that relied on this being correct so avoid having to perform time consuming rolling restarts every time... but elsewhere in our application it will still be reporting the incorrect totals for us.
Given we are not restarting the nodes immediately due to this issue I have more time to investigate it if you have any more ideas ?
What I have noticed today is. I can query for an Id
POST my_index/my_type my_other_type/_search?preference=2904
{
"query": {
"ids": {
"values": [239563172]
}
}
}
Which returns incorrect totals.
I can query for another Id
POST my_index/my_type my_other_type/_search?preference=2904
{
"query": {
"ids": {
"values": [225238857]
}
}
}
Correctly returns 1
I can query for both Ids
POST my_index/my_type my_other_type/_search?preference=2904
{
"query": {
"ids": {
"values": [239563172, 225238857]
}
}
}
Correctly returns 2 !
Using the profile parameter I ran the query with incorrect totals and the one with both. The profile results at the top level look the same... querying the same nodes and shards ? Looks like all 6 shards participating in the query ?
"profile": {
"shards": [{
"id": "[-aSRX8nEQdO_augx-5OAFw][my_index][4]",
"searches": [...],
"aggregations": [...]
}, {
"id": "[-aSRX8nEQdO_augx-5OAFw][my_index][5]",
"searches": [...],
"aggregations": [...]
}, {
"id": "[QmSRDHkyQeaeZU8sAkRwcw][my_index][2]",
"searches": [...],
"aggregations": [...]
}, {
"id": "[RBL1o_YbSvOk88d1agNmrw][my_index][1]",
"searches": [...],
"aggregations": [...]
}, {
"id": "[RBL1o_YbSvOk88d1agNmrw][my_index][3]",
"searches": [...],
"aggregations": [...]
}, {
"id": "[uUJfY22pSp2y8bFAsBbFIQ][my_index][0]",
"searches": [...],
"aggregations": [...]
}
]
}
Elasticsearch version: 5.3.1 Elasticsearch setup: 9 node cluster Plugins installed: repository-s3, x-pack JVM version: openjdk version "1.8.0_131" OS version: Ubuntu Server 16.04 Machine specifications: 8x CPU, 8GB RAM, 150GB SSD, 10Gbps Network
Description of the problem including expected versus actual behavior: Elasticsearch returns incorrect hits.total value even though results seem correct.
After stripping the query back considerably we see that the hits.total inconsistency happens when using the ids query below, no matter how many or which ids we use.
Steps to reproduce: We so far haven't been able to work out what triggers this particular bug.
Example query and results as shown in the attached image below.
A rolling restart of the cluster seems to resolve the issue, I'm not sure if this is due to an in-process bug or corrupt index.
We would like to find if this is a potential bug in Elasticsearch or has anyone seen this happening as we had no luck finding what is causing it.