blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
10.09k stars 686 forks source link

Index.Search() sometimes returns 0 results #217

Open piger opened 9 years ago

piger commented 9 years ago

As the title suggest Search() sometimes returns 0 results:

$ ./corpus benzina
2015/07/20 17:37:13 No results
$ ./corpus benzina
0.2252  docs/benzina.txt

$ for i in $(seq 1 100); do ./corpus -highlights benzina 2>&1| wc -l; done
       1
       4
       1
       4
       1
       1
       1
       4
       1
       ...

The command in the for loop should always return 4 lines of text showing highlights from the results.

I'm using the latest bleve version with 3 custom Analyzers (for the libstemmer token filter) and 3 different DocumentMappings; I get the same behavior on OS X (go 1.4.2) and Linux (go 1.3.3), both amd64.

The code of corpus is on github at https://github.com/piger/corpus on the branch analyzers.

The code for the custom analyzers is copied from the bleve sources as you can see here: https://github.com/piger/corpus/blob/analyzers/analyzers.go

Index mappings, search and index functions resides in: https://github.com/piger/corpus/blob/analyzers/index.go

mschoch commented 9 years ago

So far I haven't been able to reproduce this. First, one (probably) unrelated thing:

cmd/corpus references "git.autistici.org/ale/corpus" and "git.autistici.org/ale/corpus/file" not "github.com/piger/corpus" and "github.com/piger/corpus/file" -- I made this change locally, not sure if this code is identical, but its one possible source of different behavior

Next, I don't know the contents of docs/benzina.txt -- so I changed the walker MinSize to 0, and gave docs/benzina.txt the contents "benzina". When I do this, the command:

$ ./corpus benzina 0.1941 docs/benzina.txt

Seems to match every time. Regardless of the content of the file, its still not clear to me how the behavior is going to be non-deterministic. But, I suppose the next step is still to try and reproduce what you see with the actual contents of this file.

Can you share the contents of this file?

piger commented 9 years ago

Sorry for the wrong import paths, I totally forgot about that; I've pushed a fix on the analyzers branch. The file benzina.txt contains a small excerpt from an italian Wikipedia page an you can find it here

To reproduce step-by-step:

go get -d -tags "libstemmer" github.com/piger/corpus
cd $GOPATH/src/github.com/piger/corpus
git checkout -b analyzers origin/analyzers
make corpus
# next command will create a directory "db"
./corpus -index /path/to/benzina.txt
./corpus benzina

Expected output:

 0.0691  benzina.txt

Sometimes I get:

2015/07/20 21:32:06 No results
mschoch commented 9 years ago

Thanks for providing so many details. I have now been able to reproduce the behavior you're reporting.

Here is what I suspect is going on (I haven't actually traced through all the code to verify it, but there is some supporting evidence).

Summary: When you have multiple document types with the same field names, there can be ambiguities at query time.

Details:

Doc types doc_it, doc_en and doc all have the same fields. Bleve (like Elasticsearch) has some limitations when you have distinct types that contain the same fields. Specifically, underneath the hood they map to the same field at the lower level.

Often, this is what you want, because users may just want to search on "title" and not care what type it is. However, there is a problem when you go to execute a search. If the fields in the respective types have different analyzers, its ambiguous how a search for "benzina" should be performed.

A quick check in the Analysis wizard shows that the "it" analyzer will term benzina into benzin, whereas the "en" analyzer will produce benzina.

Since the query for "benzina" doesn't specify a field, the first thing Bleve does is look in the mapping for the DefaultField, this code base has it set to "content". Next, bleve looks up the correct analyzer for the "content" field. But, it doesn't know what document type, so it looks through each type, optimistically assuming that the names will be unique.

The order in which we visit the types is not guaranteed, and I believe the runtime even intentionally randomizes the behavior. What this means is that, sometimes we get the "doc_en" and sometimes we get "doc_it". That means that sometimes a search for "benzina" searches for the term "benzina" and sometimes it searches for "benzin".

In our case, the document was correctly indexed as "it", so we will only find the document when we search for "benzin".

You can confirm this, because if you explicitly search for "benzin" you will always find the document.

So, thats a rather long explanation of why it behaves the way it does. For better or worse, Elasticsearch can have this same issue, and that's how we inherited it. See https://www.elastic.co/blog/great-mapping-refactoring for more details.

At the moment there is no great solution for this. One idea we're considering is to greatly simplify mappings, remove types, and encourage users to put different types into different indexes, and then use IndexAlias to search across them. This still won't work correctly today, but it could be made to correctly run the appropriate search on each of the underlying indexes.

I'm going to run this by a few other people and try to come up with some more concrete plans.

piger commented 9 years ago

Hi Marty, thank you for taking time to write such a detailed analysis, it has been really interesting even if it means that I have to put my pet project on hold for a while :)

c4milo commented 7 years ago

I believe I might be running into this one too, it says there was 3 hits, but does not return documents:

{
  "status": {
    "total": 1,
    "failed": 0,
    "successful": 1
  },
  "request": {
    "query": {
      "match": "amd64 darwin auth",
      "prefix_length": 0,
      "fuzziness": 0
    },
    "size": 0,
    "from": 0,
    "highlight": null,
    "fields": null,
    "facets": null,
    "explain": false,
    "sort": [
      "-_score"
    ]
  },
  "hits": [],
  "total_hits": 3,
  "max_score": 0.19621430835857606,
  "took": 159628,
  "facets": {}
}
mschoch commented 7 years ago

No, in your case, you have requested "size" of 0. At least that is what is echo'd back in the request portion. When you request size 0, you get 0 hits. The size should default to 10 if you use NewSearchRequest().

c4milo commented 7 years ago

that did it! thank you! I'm a noob trying out Bleve as you already noticed. Thanks for this project, it's looking great!