blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
10.08k stars 684 forks source link

Support numeric facets #638

Open rikvdh opened 7 years ago

rikvdh commented 7 years ago

Storing data as an int or uint and retrieving it later as facets gives odd results.

I have 212 documents in my index in this example with Brand.ID as the type uint. When JSON-encoding the facet it gives the following result:

{"field":"Brand.ID","total":3392,"missing":0,"other":1272,"terms":[{"term":" \u0001?x\u0000\u0000\u0000\u0000\u0000\u0000\u0000","count":212},{"term":"$\u000b@\u0000\u0000\u0000\u0000\u0000\u0000","count":212},{"term":"_|\u0000\u0000\u0000\u0000\u0000\u0000","count":212},{"term":",\u0005`\u0000\u0000\u0000\u0000\u0000","count":212},{"term":"0/~\u0000\u0000\u0000\u0000\u0000","count":212},{"term":"4\u0002p\u0000\u0000\u0000\u0000","cot":212},{"term":"8\u0017\u0000\u0000\u0000\u0000","count":212},{"term":"\u003c\u0001?x\u0000\u0000\u0000","count":212},{"term":"@\u000b@\u0000\u0000","count":212},{"term":"D_|\u0000\u0000","count":212}

Looking at the code, the TermFacet struct doesn't account for the fact that Terms can have other types than string.

https://github.com/blevesearch/bleve/blob/75d75bf1bc04a1265d4e4e73994424d318955f83/search/facets_builder.go#L73

Is this intended behavior?

mschoch commented 7 years ago

If you index the field as a number, then you have to use numeric range facets and not term facets. Otherwise, you will get the internal numeric terms coming back, which isn't what you want.

There doesn't seem to be a good example to point you to, but you need to provide these NumericRanges:

https://github.com/blevesearch/bleve/blob/master/search.go#L109

Basically, you define buckets for the numbers to fall into.

rikvdh commented 7 years ago

I see, but in some way this still not satisfies all use-cases, in some cases you probably want to just get the facets as with the TermFacets and not predefine any buckets.

If you for example have a car-search where you want to search for cars with 2, 3 or 4 wheels and in later stages 5 wheels or cars without wheels are created you need to modify the buckets every time you change the dataset.

An alternative solution is to store the value as a string, but this feels like a workaround and not a real solution.

In any case I do not expect Bleve to give me garbage results as facets with terms like: " \u0001?x\u0000\u0000\u0000\u0000\u0000\u0000\u0000"

mschoch commented 7 years ago

Numeric values are encoded into multiple terms in the index (your garbage values). Term facets operate on these raw terms. So it is working exactly as expected in that case.

The only way to get the behavior you want is for us to "know" these are numerically encoded terms, but there is no guaranteed way to do that. It is an unfortunate side-effect of the implementation.