blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
10.13k stars 686 forks source link

Phonetics analyzer - Metaphone and Double Metaphone algorithm #511

Open vividvilla opened 7 years ago

vividvilla commented 7 years ago

It would be nice to have phonetics token filters. Metaphone and Double Metaphone are the most popular algorithms available now. Here is a list of Go libraries which implement it

gleicon commented 7 years ago

I'd be up to tackling this if anyone had some tips on where to start/best practices. Already familiar w/ metaphone and non-english languages.

snadrus commented 3 years ago

I'd like to pick this up with some guidance. To show I'm capable & serious, I've implemented Metaphone3 in Golang: Metaphone3 in Go

mschoch commented 3 years ago

@snadrus great, unfortunately the repo you linked to appears to be empty.

So this ticket is requesting support for a bleve Analyzer. Here is what a bleve Analyzer looks like:

https://github.com/blevesearch/bleve/blob/master/analysis/type.go#L74-L80

Essentially, a sequence of character filters are invoked. Then a single tokenizer is invoked turning the []byte into a TokenStream ([]Token). And then a sequence of token filters are invoked which can add/modify/delete the individual tokens.

As I can't see how the metaphone3 works, my best guess is that you should start by creating a token filter. The token filter will range over each token in the stream. At this point you pass the token text to metaphone3, get something back, and turn this output into either new or modified tokens.

If you have more questions let me know.

snadrus commented 3 years ago

Oops, I pushed it now. Thanks for the guidance, but here's my next problem:

m := metaphone3.New()
primary, alternate := m.Encode("choch")

Given this, I now have 3 tokens in order of value:

How can I express that all 3 tokens apply, but apply greater significance to an exact match?

mschoch commented 3 years ago

So, there isn't a way to explicitly do that, but I suggest we ignore that temporarily. Let's get it working first, and then we review it doesn't work well in some cases, or there are cases we can improve.

To start, just have the token filter emit all 3 tokens, and use the same "position" as the original term. That should allow phrases to match correctly as well.

jgschis commented 1 year ago

Was this every completed?

snadrus commented 1 year ago

No, but you're welcome to do so. I no longer need this.