Open colasri opened 2 years ago
I am not entirely sure but there seems to be an issue with the group_all_strings
parameter combined with link_min_similarity
. What most likely is happening is that (appl, apple)
gets into the cluster apples
and (happy, appl)
gets into the same cluster because it shared appl
. I'll have to dig a little deeper to figure this stuff out but I'll make sure it gets released in the next version!
In the code below (with output in attached picture) I perform a simple TFIDF matching of
["apple", "apples", "appl", "recal", "happy"]
.The initial
min_similarity
is set to 0.2. The similarity ofhappy
andappl
is 0.24.When grouping with a
link_min_similarity
of 0.5,happy
should not belong in theapples
group, though that's what happens in the output of.get_matches()
, it is in theapples
group.It appears it is not in the cluster though.
Plain text code: