geekusa / nlp-text-analytics

13 stars 6 forks source link

Removal of identical data in similarity command #5

Closed pa1007 closed 2 years ago

pa1007 commented 2 years ago

Hi @geekusa,

I would like to know what is the reasoning behind removing similar data (with a score of 1 in similarity and 0 in distance) from the output of the similarity command :

 if t != c:
    result = self.algo_select(t, c, transposition, set_algo, algo)
    if 'edit' in algo:
          compare_dict[t+'>'+c] = (
          result,
          self.distance_to_ratio(result, len(t), len(c))
           )
    else:
          compare_dict[t+'>'+c] = result

It can be found at https://github.com/geekusa/nlp-text-analytics/blob/master/bin/similarity.py#L215

If it is not intended I have a pull request ready to remove it :)

geekusa commented 2 years ago

Hi @pa1007 , A similarity score of 1 or 0 in distance means an exact match which prevents it from returning results of itself (basically a duplicate).

pa1007 commented 2 years ago

Yes but in some applications (if we want to find the nearest match to a string or a list of strings) we need the exact match if it exists or the closest correspondence, it would be a great idea to let the user the choice to remove them or let them appear in the results

geekusa commented 2 years ago

Sorry I haven't been close to the code in awhile, I'm pretty sure the command still returns exact matches as I have seen it do that. But I think maybe that code belongs to the multi-value compare and I seem to remember if it isn't there you get unwanted duplicates. Have you found the results in the Splunk interface to give you duplicates if you removed it?

pa1007 commented 2 years ago

Yep it belongs to the multi-value compare

For example : In this example, we can see that "brain" compared to "brain" does not show in the results image

But if we remove the line shown in the original post : image We can see that it shows the exact match

So for my application of the command, I need to have the exact match if they exist as I want to show the top similarity between two lists of strings

geekusa commented 2 years ago

Very good, it is impossible to know all applications of the commands that others may use. So if you have a pull request that makes it optional that works for me.

On Thu, Aug 25, 2022 at 11:25 AM Paul-Alexandre Fourrière < @.***> wrote:

Yep it belongs to the multi-value compare

For example : In this example, we can see that "brain" compared to "brain" does not show in the results [image: image] https://user-images.githubusercontent.com/29738353/186729563-0505a393-f11a-49c6-88b3-d87919cb4003.png

But if we remove the line shown in the original post : [image: image] https://user-images.githubusercontent.com/29738353/186729443-987c7e78-3e1d-494f-a6db-32e969b3c411.png We can see that it shows the exact match

So for my application of the command, I need to have the exact match if they exist as I want to show the top similarity between two lists of strings

— Reply to this email directly, view it on GitHub https://github.com/geekusa/nlp-text-analytics/issues/5#issuecomment-1227557383, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7YAY4NH7C2IXC7E46JXW3V26T73ANCNFSM57ETIEWA . You are receiving this because you were mentioned.Message ID: @.***>

pa1007 commented 2 years ago

PR done #6

pa1007 commented 2 years ago

@geekusa any updates?

geekusa commented 2 years ago

merged and thank you!

pa1007 commented 2 years ago

Thank you !