Closed tkocmathla closed 7 years ago
Seems like a nice addition. Did you find a use case where using the dice/sorensen coefficient is better than applying a Jaccard index on tri- or quadrigrams?
I admit to ruling out Jaccard when I saw that jaccard/distance used unigrams (implictly for string inputs). I didn't think of partitioning the strings myself and using jaccard/index.
Here's some tests for uni-, bi-, and tri-grams on two inputs "abcd" and "abcx". The results from dice/coefficient seem more like what I'd intuitively expect. Is this a fair comparison?
=> (clj-fuzzy.dice/coefficient "abcd" "abcx" :n 1)
0.75
=> (clj-fuzzy.jaccard/index (set (partition 1 1 "abcd")) (set (partition 1 1 "abcx")))
3/5
=> (clj-fuzzy.dice/coefficient "abcd" "abcx" :n 2)
0.6666666666666666
=> (clj-fuzzy.jaccard/index (set (partition 2 1 "abcd")) (set (partition 2 1 "abcx")))
1/2
=> (clj-fuzzy.dice/coefficient "abcd" "abcx" :n 3)
0.5
=> (clj-fuzzy.jaccard/index (set (partition 3 1 "abcd")) (set (partition 3 1 "abcx")))
1/3
Well I guess it highly depends on your use case and your data here. One advantage of the Jaccard index, for instance, is that it is a true metric while dice is a semimetric unfit for some indexations.
What are you trying to achieve here?
Anyway, I don't see a reason why not to merge your PR so I'll try to find some time to do it soon. Ping me if I forget.
Ping
v0.4.0 is now published on clojars & npm.
This PR adds a new option to dice/coefficient to allow user control of the n-gram size. The current implementation hard-codes the n-gram size to 2. This change keeps 2 as the default.