terms facet gives wrong count with n_shards > 1

jmchambers commented 13 years ago

I'm working with nested documents and have noticed that my faceted search interface is giving the wrong counts when I have more than one shard. To be more specific, I'm working with RDF triples (entity > attribute > value) and I'm nesting the attributes (called predicates in my example):

{
  "_id" : "512a2c022f0b4e3daa341e6c8bcf6c2f",
  "url": "http://dbpedia.org/resource/Alan_Shepard",
  "predicates": [
    {
      "type": "type",
      "string_value": ["thing", "person", "astronaut"]
    }, {
      "type": "label",
      "string_value": ["Alan Shepard"]
    }, {
      "type": "time in space",
      "float_value": [216.950]
    },
    ... lots more
  ]
}

I've created a shell script (https://gist.github.com/1196986) that recreates the problem with a fresh index. The created data set has these totals:

thing (30)
creative work (20)
video game (10)
tv show (10)
people (10)

With only one shard the following query gives the correct counts no matter what the size parameter is set to:

{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "facets": {
    "type_counts": {
      "terms": {
        "field": "string_value",
        "size": 5
      },
      "nested": "predicates",
      "facet_filter": {
        "term": {
          "type": "type"
        }
      }
    }
  }
}

However, with more than one shard the size parameter affects the accuracy of the counts. If it is equal to or greater than the number of terms returned by the facet query (5 in this case) then it works fine. However, the terms at the bottom of the list start to display low counts as you reduce the size parameter:

With "size" : 4

thing (30)
creative work (20)
video game (10)
tv show (9)

With "size" : 3

thing (30)
creative work (15)
video game (9)

With "size" : 2

thing (30)
creative work (15)

So it looks like the sub-totals from some of the shards aren't being included for some reason. BTW I'm on ubuntu and the problem seems to affect all versions of ES I've tried (17.0, 17.1 and 17.6). Any ideas...?

P.S. absolutely loving ES - it's made my life a lot easier :)

losomo commented 12 years ago

+1 for this bug. I have reproduced the problem using just documents with one field. Complete test (as run on version 0.17.8): as Perl script: https://gist.github.com/1292897 generated shell test: https://gist.github.com/1292912 The first error can be seen on line 951. Expected:

"terms" : [
 {
 "count" : 10,
 "term" : "user 10"
 }
 ],

Got:

"terms" : [
 {
 "count" : 7,
 "term" : "user 9"
 }
 ],

losomo commented 12 years ago

After more experimenting I'd say that it's caused by naïve top-N facet merging. Something like Phase three here should be added:

http://wiki.apache.org/solr/DistributedSearchDesign#Phase_3:_REFINE_FACETS_.28only_for_faceted_search.29

Or something smarter: http://netcins.ceid.upatras.gr/papers/Klee_VLDB.pdf

kimchy commented 12 years ago

Right, the way top N facets work now is by getting the top N from each shard, and merging the results. This can give inaccurate results. The phase 3 thingy is not really a solution, will read the paper though :)

losomo commented 12 years ago

Right. The "phase 3 thingy" only solves the second problem:

The recall is not 100% (some values may be simply missing)
The numbers are often lower than they should be.

But while the first problem can easily stay unnoticed, the second one leads to quite a lousy user experience in very common faceting scenarios: The web displays top 10 commenters with number of their comments in parentheses. You click the last one with 30 comments and Voilà, it shows her 48 comments. This is hard not to notice.

Meanwhile a slow and not-really-always-working workaround: Set size to what you want + 150 or another magical constant when requesting the facets and then filter out the top size facets on the client side.

karussell commented 12 years ago

The phase 3 thingy is not really a solution

@kimchy: why do you think so? isn't it a similar approach as query then fetch? now only for facets and its counts instead of docs and its score? (it even could be implemented in the second query to avoid traffic, no?)

Another workaround would be 'routing': every facet value should go to the same shard. But then the data should have a lot facet values and all of them should have not a too high count to avoid wrong balancing ...

agnellvj commented 12 years ago

Is this on the roadmap? We discovered this same problem with regular fields and large datasets with multiple nodes and shards. This is problematic for us since the faceted counts can fluctuate wildly based on what filters we apply. Is there a workaround?

piskvorky commented 12 years ago

I still see this bug in 19.4, and the wrong counts are also a show-stopper for us.

Adding "150 or another magical constant" only helps for tiny datasets; does there exist a more robust workaround until this is fixed properly?

jmchambers commented 12 years ago

+1 for a better workaround or fix

The only completely robust workaround I know of is to limit yourself to one shard. Obviously that pretty much takes the 'elastic' out of 'elasticsearch', but if accurate counts are critical to your app and your index isn't too big you might get away with it...

tarunjangra commented 12 years ago

Any update on this? I do have an application where counts are critically important. As you are saying to limit to one shard for a particular entity which suppose to undergo such queries. Does this make sense to route entities by entity type in corresponding shards?

karmi commented 12 years ago

The only completely robust workaround I know of is to limit yourself to one shard. Obviously that pretty much takes the 'elastic' out of 'elasticsearch' (...)

@jmchambers Not really -- you can slice the data into many one-shard indices, with a weekly (daily, hourly, ...) rotation. You can use aliases (or wildcards, ...) to query them in a reasonable way from your application. Would mulitple one-shard indices work around the facet problem?

piskvorky commented 12 years ago

@karmi: only if the indices are large enough. To my knowledge, the scoring stats (IDF etc) are computed per-index (actually, per-shard by default). So comparing relevancy scores across indices is like comparing apples to oranges, even with dfs_query_then_fetch.

But once the indices are large enough, this doesn't matter anymore as the stats converge (assuming identical doc/word distribution in each index, for the stats to actually converge).

ajhalani commented 12 years ago

+1 for a fix. The workaround (one-shard, multiple one-shard index, etc.) don't sound convincing.

Downchuck commented 12 years ago

I'm a bit confused on this issue: I have five shards and the top value occurs on each shard as #1, by a large margin. Still, when I facet, I get a 20% smaller number than when I simply query for the value directly.

jrydberg commented 12 years ago

@karmi

you can slice the data into many one-shard indices, with a weekly (daily, hourly, ...) rotation

Won't you run into the same problem when doing a faceted count over multiple indices?

giamma commented 11 years ago

Any news about this issue?

tgruben commented 11 years ago

Any hope of this being fixed in the near future?

danfairs commented 11 years ago

For what it's worth, the multiple index/single shard approach is working for us. Our application has a new index per week of data anyway, so it's not actually too painful for our current usage.

markwaddle commented 11 years ago

+1 for a fix My application has 7m+ docs of varying sizes (600GB index) and growing so 1 shard is not feasible. For my application I am willing to trade performance (and/or hardware resources) for accurate facet counts.

webmusing commented 11 years ago

+1 for a fix

Terms Stats Facet seems to be affected with the same issue as well.

piskvorky commented 11 years ago

It seems there is no fix forthcoming; how about at least making the most common scenario less annoying/more palatable?

What I mean by that: when ES returns let's say 20 facets (with the wrong counts...), it could automatically run a second request against all shards asking for an accurate count only for these exact 20 facets. Then add up these accurate counts, and only return that as a result. Would that make sense?

vincentpoon commented 11 years ago

+1 for a fix. Incorrect facet counts is resulting in a bad user experience

songday commented 11 years ago

Seems this bug did not fix completely

I did a facet query and sort by count, the results were right: "terms" : [ { "term" : "AAA", "count" : 59, "total_count" : 59, "min" : 1.0, "max" : 54.0, "total" : 391.0, "mean" : 6.627118644067797 }, { "term" : "BBB", "count" : 55, "total_count" : 55, "min" : 1.0, "max" : 17.0, "total" : 154.0, "mean" : 2.8 }]

but if sort by total (same query), the results dose not right: "terms" : [ { "term" : "AAA", "count" : 56, //this is not right "total_count" : 56, "min" : 1.0, "max" : 54.0, "total" : 388.0, "mean" : 6.928571428571429 }, { "term" : "BBB", "count" : 56, //this is not right "total_count" : 56, "min" : 1.0, "max" : 17.0, "total" : 171.0, "mean" : 3.0535714285714284 }]

tommymonk commented 11 years ago

+1

HeyBillFinn commented 11 years ago

Hi,

I am trying to run a search query using multiple facets, and I want the facet numbers to update in the same way as the site Zappos does. For example, when I search for Nike in the men's shoe section, I get facets for Brand and Color.

[cid:93D592E2-8A16-49EC-81A2-5F5804ED1836]

When I narrow my search by Brand (by selecting 'Nike'), the numbers within the Brand facet do not change, but the numbers within the Color facet change to reflect the narrowed search results. [cid:8C58B26B-E7AC-44F3-8DDA-23A0BBD0AD0B]

I can widen my search by selecting 'Nike Action', and still no numbers within the Brand facet are updated, but numbers in the Color facet have been updated to reflect the additional results.

[cid:56423869-2538-47FA-8F43-D6355DC5AA48]

I can see the same expected results if I select a term within the Color facet:

[cid:BDC2DCEB-B7A5-4E86-BFF6-3DDD68E7ADED]

I can think of two ways to do this using Elastic Search, and I'm looking for any guidance/suggestions as to the best way to implement this.

Filtered query using fairly complex facet filters within each facet, including global = true flag.
Top-level filter (which I understand does not affect the facet results) with slightly less complex facet filters within each facet

Which of the two options would perform better? Is there a better option that I'm not thinking of? I can add example JSON if it will help explain my thoughts.

Thanks!

Bill

danfairs commented 11 years ago

@finn1317, this is something you should ask on the elasticsearch mailing list. I don't think it's related to the issue being discussed here.

peakx commented 11 years ago

+1 for a fix.

SanderDemeester commented 11 years ago

Question, could this bug be related to the following result i'm getting. If i request the first json document where the value for field x is y i get a result back.
But when If i request all distict values for field x i get a list back with values. Byt 'y' is not included.

query uses the all_terms and i match use match_all.

kaliseo commented 11 years ago

+1

Same issue : ElasticSearch Version: 0.90.1

seti123 commented 11 years ago

I noticed wrong TermsFacet counts after updating documents (means I put new doc with same _id). Using ES 0.90.2. _version counts up. A few docs got more than 11 versions and the facet counter seems to count each version (at least it is growing with the updates I do). When I query I get only latest doc (as far I know ES is holding only last Doc and no history - on the other hand I wished to query for _version-1 to compare changes ...).

My application is reading the docs and depnding on other information it might write a new version to the index with additional information. But the TermsFace counts get then completely wrong ... (e.g. counting docs with same title 11 times instead of 1 time).

Any advice?

spancer commented 11 years ago

+1

I got this issue on ElasticSearch Version: 0.90.0

swarzech commented 10 years ago

+1

I am using Version 0.90.3

thefastman commented 10 years ago

+1 still see this bug in ES version 0.90.5

lou commented 10 years ago

:+1:

fabien7337 commented 10 years ago

+1 it's really a big problem for a solution that pretended to be "analytics" :)

lukas-vlcek commented 10 years ago

@zywx I think it depends. I found the following article quite interesting "Why you don't want real-time analytics to be exact" by Mikio Braun. Although it does not discusses Elasticsearch in particular it has interesting points. And depending on your particular use case of terms facet you can make couple of tweaks to get more accurate or exact results from ES.

tarunjangra commented 10 years ago

@lukas-vlcek Thanks for sharing this article. It really make sense when we talk about big data terms.

kimchy commented 10 years ago

Btw, I think we can do better even with the current structure, by allowing to ask for more terms to be returned per shard compared to the final size returned. This should not be difficult.

Also, we could do exact counts, yet not exact ordering, potentially with another roundtrip. It will require more work on the search execution side.

We also plan to add more statistical models support as part of our future aggregation model, that allows to get results within a variance of error.

fabien7337 commented 10 years ago

Yes it's my problem.

I follow your directive because I have a search to order the elements on a facet, then a search to get all the facets only for the elements. But it produces bad sort... :(

When you said: "We also plan", what is the roadmap?

mattweber commented 10 years ago

@kimchy I was thinking of something with statistical models as well. Maybe use the current method to stream the top N from each node with exact counts plus a serialized Count-min Sketch. Then when merging the results from each node, if for some reason one of the terms was not in the top N on a given node, we could check that node's sketch for an approximate count for the given term. This could avoid another roundtrip to "refine" the counts, but I guess it depends if that roundtrip would be faster than sending the sketch....

fabien7337 commented 10 years ago

Otherwise maybe the best way is to control the routing? Maybe the facets are correct if each key for term is all on the same shard?

javanna commented 10 years ago

Have a look at #3821 which allows to increase the number of terms that are requested per shard, in order to make the counts more accurate. It will be available in the upcoming 0.90.6.

thanodnl commented 10 years ago

I've also experiencing this problem, and reading through this thread it is clear that we are not the only one.

Although the patch contributed above could account for more accurate values I would really like the facet to be more transparent in how accurate the answer is. This is easily done in the merging phase by including the number of shards the term was included in. Almost the same as you can find on the whole query.

"_shards": {
   "total": 10,
   "successful": 10,
   "failed": 0
},

By including an 'included_shards' field per term the application can asses how accurate the numbers are. For example every term where the included_shards equal the number of _shards.total the answer is accurate and can be shown to the user with confidence.

Terms where the included_shards is lower than _shards.total you can show the numbers as approximates (~9), or if the numbers are really important do an extra round trip in the application to only count these terms.

piskvorky commented 10 years ago

@thanodnl An explicit API flag for when accurate counts are needed (=an extra round trip) makes more sense IMO. The included_shards sounds like an implementation detail, why bother with it? What's the use case?

Either you need accurate facet counts, or you don't care (true/false flag). And default=true sounds like the proper way to me.

lukas-vlcek commented 10 years ago

@piskvorky If I read it correctly, extra round trip could give you exact counts but still it does not guarantee that output terms are really "the top". Using what @thanodnl suggests you could easily recognise this situation. Actually, I like @thanodnl 's suggestion.

piskvorky commented 10 years ago

@lukas-vlcek I agree, neither is a full solution.

What I'm suggesting is that a simple flag for "give me accurate counts, at least" may be simpler to use = what users typically want. I'm not thrilled about inspecting some internal included_shards statistics on query return, then issuing extra queries based on that. With the logic polluting my code.

But of course, different people need different things :)

jmajaranta commented 10 years ago

I have been working on implementing the 3-phase uniform threshold algorithm (TPUT) to ES for exact facet term counts. The algorithm gives exact results with 3 fixed phases. I had a proof-of-concept level code for the facets to support multiple round trips and coordination as a generic approach to support different kinds of multi-round-trip algorithms, but the solution relied on the search context being available on every round trip. This is true for example for the query_then_fetch method where you can run multiple round trips with the same context (and facet collector results) before the fetch phase closes the context.

Because the implementation would change too many places in the core of the search module, I'm working on making this a plugin with a separate rest endpoint and with support for only the terms facet.

Anybody else working on adding true exact terms count facet support ?

kjeremy commented 10 years ago

This also affects range facets. The count returned is drastically off even when I specify a size well above the number of matches.

khong07 commented 10 years ago

Hi,

Does the new "accuracy control" on 0.90.6 fix this issue? Anyone have tested? http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-terms-facet.html#_accuracy_control

Thanks

Sesshomurai commented 10 years ago

This is a problem still. With even just one node. Nested facet counts do not equal the document query result totals as it should. Is there a fix forthcoming or workaround?

xzer commented 10 years ago

I got a related bug. As a workaround for the wrong term count, I set the size to a reasonable big value, then I got the semi-accurate term count by plus the size of terms array to the returned "other" count which is semi-reliable due to the reasonable big size.

After did this, I found that I need to exclude some terms, so I wrote my facet as following:

  "facets": {
    "collaParty": {
      "terms": {
        "field": "party",
        "size": 20000,
        "exclude": ["someword"]
      }
    }
  }

then, the returned "other" count become a completely wrong value which is exactly the document count hit by this query. see the value 9343 in the following result, it is the value of total hits and is wrongly introduced to the facets "other" field.

{

    took: 47
    timed_out: false
    _shards: {
        total: 5
        successful: 5
        failed: 0
    }
    hits: {
        total: 9343
        max_score: 1
        hits: [
            {
                _index: indexdb_test
                _type: xxx
                _id: 1406893
                _score: 1
                _source: {...}
            }
            ...   
        ]
    }
    facets: {
        collaParty: {
            _type: terms
            missing: 0
            total: 20944
            other: 9343
            terms: [
                {
                    term: "ffs"
                    count: 465
                }
                {... }
                ...
            ]
        }
    }
}

Currently, I think the workaround is setting the facet size to an impossible big value, which is so terrible....

elastic / elasticsearch

terms facet gives wrong count with n_shards > 1 #1305