Yomguithereal / clj-fuzzy

A handy collection of algorithms dealing with fuzzy strings and phonetics.
http://yomguithereal.github.io/clj-fuzzy/
MIT License
261 stars 27 forks source link

Add option to dice to control n-gram size #42

Closed tkocmathla closed 7 years ago

tkocmathla commented 7 years ago

This PR adds a new option to dice/coefficient to allow user control of the n-gram size. The current implementation hard-codes the n-gram size to 2. This change keeps 2 as the default.

Yomguithereal commented 7 years ago

Seems like a nice addition. Did you find a use case where using the dice/sorensen coefficient is better than applying a Jaccard index on tri- or quadrigrams?

tkocmathla commented 7 years ago

I admit to ruling out Jaccard when I saw that jaccard/distance used unigrams (implictly for string inputs). I didn't think of partitioning the strings myself and using jaccard/index.

Here's some tests for uni-, bi-, and tri-grams on two inputs "abcd" and "abcx". The results from dice/coefficient seem more like what I'd intuitively expect. Is this a fair comparison?

=> (clj-fuzzy.dice/coefficient "abcd" "abcx" :n 1)
0.75
=> (clj-fuzzy.jaccard/index (set (partition 1 1 "abcd")) (set (partition 1 1 "abcx")))
3/5

=> (clj-fuzzy.dice/coefficient "abcd" "abcx" :n 2)
0.6666666666666666
=> (clj-fuzzy.jaccard/index (set (partition 2 1 "abcd")) (set (partition 2 1 "abcx")))
1/2

=> (clj-fuzzy.dice/coefficient "abcd" "abcx" :n 3)
0.5
=> (clj-fuzzy.jaccard/index (set (partition 3 1 "abcd")) (set (partition 3 1 "abcx")))
1/3
Yomguithereal commented 7 years ago

Well I guess it highly depends on your use case and your data here. One advantage of the Jaccard index, for instance, is that it is a true metric while dice is a semimetric unfit for some indexations.

What are you trying to achieve here?

Yomguithereal commented 7 years ago

Anyway, I don't see a reason why not to merge your PR so I'll try to find some time to do it soon. Ping me if I forget.

tkocmathla commented 7 years ago

Ping

Yomguithereal commented 7 years ago

v0.4.0 is now published on clojars & npm.