Closed pa1007 closed 2 years ago
Hi @pa1007 , A similarity score of 1 or 0 in distance means an exact match which prevents it from returning results of itself (basically a duplicate).
Yes but in some applications (if we want to find the nearest match to a string or a list of strings) we need the exact match if it exists or the closest correspondence, it would be a great idea to let the user the choice to remove them or let them appear in the results
Sorry I haven't been close to the code in awhile, I'm pretty sure the command still returns exact matches as I have seen it do that. But I think maybe that code belongs to the multi-value compare and I seem to remember if it isn't there you get unwanted duplicates. Have you found the results in the Splunk interface to give you duplicates if you removed it?
Yep it belongs to the multi-value compare
For example : In this example, we can see that "brain" compared to "brain" does not show in the results
But if we remove the line shown in the original post : We can see that it shows the exact match
So for my application of the command, I need to have the exact match if they exist as I want to show the top similarity between two lists of strings
Very good, it is impossible to know all applications of the commands that others may use. So if you have a pull request that makes it optional that works for me.
On Thu, Aug 25, 2022 at 11:25 AM Paul-Alexandre Fourrière < @.***> wrote:
Yep it belongs to the multi-value compare
For example : In this example, we can see that "brain" compared to "brain" does not show in the results [image: image] https://user-images.githubusercontent.com/29738353/186729563-0505a393-f11a-49c6-88b3-d87919cb4003.png
But if we remove the line shown in the original post : [image: image] https://user-images.githubusercontent.com/29738353/186729443-987c7e78-3e1d-494f-a6db-32e969b3c411.png We can see that it shows the exact match
So for my application of the command, I need to have the exact match if they exist as I want to show the top similarity between two lists of strings
— Reply to this email directly, view it on GitHub https://github.com/geekusa/nlp-text-analytics/issues/5#issuecomment-1227557383, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7YAY4NH7C2IXC7E46JXW3V26T73ANCNFSM57ETIEWA . You are receiving this because you were mentioned.Message ID: @.***>
PR done #6
@geekusa any updates?
merged and thank you!
Thank you !
Hi @geekusa,
I would like to know what is the reasoning behind removing similar data (with a score of 1 in similarity and 0 in distance) from the output of the similarity command :
It can be found at https://github.com/geekusa/nlp-text-analytics/blob/master/bin/similarity.py#L215
If it is not intended I have a pull request ready to remove it :)