blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
9.97k stars 676 forks source link

"Key too large" error in boltDB with a very long token #1576

Open MichaelMure opened 3 years ago

MichaelMure commented 3 years ago

My project is a distributed bug tracker embedded in git. Included in this bug tracker is indexing with Bleve to quickly search a bug.

When importing this real issue, Bleve indexed this 65536 character long token, tried to write a value with a very large key in BoltDB, which returned the "key too large" error.

I tried using the Length token filter as a workaround but somehow that doesn't work (?).

This looks like a bug to me. No matter how the text analysis is done, Bleve should not trigger such error in BoltDB. Maybe Bleve should ignore by default all token longer than an arbitrary reasonable length? Or at least those longer than what the different KV store can accept?

Thanks for this tool, it's great.

mschoch commented 3 years ago

This particular limitation affects the older upsidedown index format when using it with the (default) BoltDB storage. This index format is now deprecated and may remove support for it in a future release. While there may be some things we can do to improve this issue, it's problematic because in upsidedown we support a variety of stores with different limitations, and silently dropping data, or returning an error are both undesirable depending on what you're trying to do.

Instead, I'd recommend we focus on why the length filter isn't working for you. It is designed specifically for cases like this. Can you share more about what you tried?

MichaelMure commented 3 years ago

Ho interesting, I don't think I did anything special to use that legacy format. Is there anything to do to use the new format?

Regarding the length filter, this is what I did:

err = mapping.AddCustomTokenFilter(length.Name, map[string]interface{}{
    "max":  100.0,
    "type": length.Name,
})

Attaching a debugger, I can see that the constructor of this token filter is called, but LengthFilter.Filter(...) is never called when I index something.

If you don't mind, I have an off-topic question: is it possible that the index grow to way larger than the indexed documents if they are indexed multiple times?

mschoch commented 3 years ago

@MichaelMure the code snippet you shared just defined a custom token filter, you must then define an analyzer which uses it, and then you must configure fields to use that analyzer (or make your custom analyzer the default).

Please open a new issue with your other question.

mschoch commented 3 years ago

The snippet you showed looked a little similar to the beer-search example (it uses truncate instead of length), but it illustrates the other steps I was talking about. In addition to defining the token filter, you must also build an analyzer that uses it like:

https://github.com/blevesearch/beer-search/blob/204db6c49802de609443fab595ce08cee5e58b97/mapping_example1.go#L76-L90

Then, if you want to use this on a particular field, define a field that uses the analyzer:

https://github.com/blevesearch/beer-search/blob/204db6c49802de609443fab595ce08cee5e58b97/mapping_example1.go#L30-L32

Then use that field definition in the mapping of a field:

https://github.com/blevesearch/beer-search/blob/204db6c49802de609443fab595ce08cee5e58b97/mapping_example1.go#L48-L50

That example indexes the same field 2 ways, one with the length filter, one without (to illustrate how it works, not something you'd normally do in this case)

An alternative (not shown in this example) is to change the mappings default analyzer to the one you built (if you want this length limit to apply to all fields). That could be done with:

indexMapping.DefaultAnalyzer = "enNotTooLong"

If you still have more questions connecting these things up let me know.