Compare strings stripping accents/casi sensitive

davidggphy commented 3 years ago

First of all, thanks for the library @gandersen101 . I'm starting using it and it's really powerful.

Using SpaczzRuler with fuzzy patterns, by default it compares strings in a case-insensitive way. Is there a way of changing this behaviour?

Similarly, is there a way of comparing strings w/o taking into account accents? This is, making "test" equivalent to "tést". It could be hacked changing the string for a accent-stripped version of it (since it maintains the token structure), but maybe is an easier way.

import sys
import spacy
import spaczz
from spaczz.pipeline import SpaczzRuler

print(f"{sys.version = }")
print(f"{spacy.__version__ = }")
print(f"{spaczz.__version__ = }")

nlp = spacy.blank("en")

fuzzy_ruler = SpaczzRuler(nlp, name="test_ruler")
fuzzy_ruler.add_patterns([{"label" : "TEST", 
            "pattern" : "test", 
            "type": "fuzzy",}])

doc = fuzzy_ruler(nlp("this is a test, also THIS IS A TEST, and a tast, we have a TesT, tést, tëst"))
print(f"\nText:\n{doc}\n")
print("Fuzzy Matches:")
for ent in doc.ents:
    if ent._.spaczz_type == "fuzzy":
        print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_ratio))

Output

sys.version = '3.9.0 (default, Nov 15 2020, 06:25:35) \n[Clang 10.0.0 ]' spacy.version = '3.0.6' spaczz.version = '0.5.2'

Text: this is a test, also THIS IS A TEST, and a tast, we have a TesT, tést, tëst

Fuzzy Matches: ('test', 3, 4, 'TEST', 100) ('TEST', 9, 10, 'TEST', 100) ('tast', 13, 14, 'TEST', 75) ('TesT', 18, 19, 'TEST', 100) ('tést', 20, 21, 'TEST', 75) ('tëst', 22, 23, 'TEST', 75)

gandersen101 commented 3 years ago

Hi @davidggphy, thanks for the kind words!

Making the fuzzy matching in spaczz case-sensitive is pretty straightforward. If you're using the SpaczzRuler you can either control this on the ruler-level or the pattern-level as shown below:

Pattern-Level

import spacy
from spaczz.pipeline import SpaczzRuler

nlp = spacy.blank("en")
text = "testing, TESTING"
doc = nlp(text)

patterns = [
    {
        "label": "TEST",
        "pattern": "testing",
        "type": "fuzzy",
        "kwargs": {"ignore_case": "False"},
    },
    {
        "label": "TEST",
        "pattern": "TESTING",
        "type": "fuzzy",
        "kwargs": {"ignore_case": "False"},
    },
]

ruler = SpaczzRuler(nlp)
ruler.add_patterns(patterns)
doc = ruler(doc)

for ent in doc.ents:
    print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_ratio))

('testing', 0, 1, 'TEST', 100)
('TESTING', 2, 3, 'TEST', 100)

Ruler-Level

import spacy
from spaczz.pipeline import SpaczzRuler

nlp = spacy.blank("en")
text = "testing, TESTING"
doc = nlp(text)

patterns = [
    {
        "label": "TEST",
        "pattern": "testing",
        "type": "fuzzy",
    },
    {
        "label": "TEST",
        "pattern": "TESTING",
        "type": "fuzzy",
    },
]

ruler = SpaczzRuler(nlp, fuzzy_defaults={"ignore_case": False})
ruler.add_patterns(patterns)
doc = ruler(doc)

for ent in doc.ents:
    print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_ratio))

('testing', 0, 1, 'TEST', 100)
('TESTING', 2, 3, 'TEST', 100)

For handling accents I would recommend two approaches. One is to preprocess your text before running it through spaCy/spazz using a library like Textacy to strip out accents. This will change the text itself before you run it through spaCy/spaczz. This option probably provides the most flexibility but adds another step. The other option is to use one of RapidFuzz's fuzzy matchers that preprocesses text before fuzzy matching but won't actually change the text itself. You can control this in spaczz at the pattern and/or ruler level just like the examples above.

A word of warning though, the default RapidFuzz preprocessor "remov[es] all non alphanumeric characters - trim[s] whitespaces - convert[s] all characters to lower case" according to it's docs. RapidFuzz supports customizing the preprocessing with a custom callable however, spaczz does not currently support passing a custom callable RapidFuzz. I can add this but it'll probably be later next week before I can get to that.

The following RapidFuzz matchers do preprocessing:

"quick" (essentially the same as the default matcher but does preprocessing)
"token_set"
"token_sort"
"partial_token_set"
"partial_token_sort"
"token"
"partial_token"
"weighted"

Here's an example of changing the fuzzy matcher on the ruler level:

import spacy
from spaczz.pipeline import SpaczzRuler

nlp = spacy.blank("en")
text = "testing, TESTING"
doc = nlp(text)

patterns = [
    {
        "label": "TEST",
        "pattern": "testing",
        "type": "fuzzy",
    },
    {
        "label": "TEST",
        "pattern": "TESTING",
        "type": "fuzzy",
    },
]

ruler = SpaczzRuler(nlp, fuzzy_defaults={"fuzzy_func": "quick"})
ruler.add_patterns(patterns)
doc = ruler(doc)

for ent in doc.ents:
    print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_ratio))

('testing', 0, 1, 'TEST', 100)
('TESTING', 2, 3, 'TEST', 100)

Hopefully that helps!

gandersen101 commented 3 years ago

Hi @davidggphy did the above adequately answer your question? If you still need/want a feature implemented please let me know and I can track that in this issue, otherwise I will close this issue in the next couple days. Thanks!

davidggphy commented 3 years ago

Dear @gandersen101 ,

Sorry for my late reply. I tested what you said. Sadly, as you said, RapidFuzz performs preprocessing on the strings, but this does not involve "deaccent". It would be really interesting to add the custom callabale for preprocessing in order to compute the fuzzy scores.

As you said, the other possibility is to preprocess the text before sending it into the matcher, but then the entitities found will be preprocessed accordingly, which is something I would like to prevent. I could hack to later find the same tokens on the original text, but it will be more cumbersome.

gandersen101 commented 3 years ago

Hey @davidggphy, thanks for the additional info. I am planning on doing a spaczz feature upgrade/overhaul in the near future and will keep the ability to add custom preprocessing without modifying the doc in mind. Thanks!

gandersen101 / spaczz

Compare strings stripping accents/casi sensitive #61