explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

Multiple INFIX inside span cannot be recognized #13498

Closed nsch0e closed 4 months ago

nsch0e commented 4 months ago

A span containing more than one INFIX "tokens" will not be recognized. Screenshot from 2024-05-15 15-21-34

Expected behaviour

all 4 date strings should be recognized.

How to reproduce the behaviour

Was run in a jupyter notebook for displaying purpose

from spacy.tokenizer import Tokenizer
import re
import spacy

records = [
    ("10/02/2015", {"spans": {"sc": [(0, 10, "DATE")]}}),
    ("10/02.2015", {"spans": {"sc": [(0, 10, "DATE")]}}),
    ("10/2015", {"spans": {"sc": [(0, 7, "DATE")]}}),
    ("10.2015", {"spans": {"sc": [(0, 7, "DATE")]}}),
]
infix_re = re.compile(r"""[/\.]""")

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

model = spacy.blank("EN")
pipe = model.add_pipe("spancat")
pipe.add_label("DATE")

model.tokenizer = custom_tokenizer(model)
model.initialize()

trainData = [spacy.training.Example.from_dict(model.make_doc(t), a) for t, a in records]
optimizer = model.begin_training()
for i in range(1000):
    model.update(trainData, sgd=optimizer, drop=0.2)

for t, a in records:
    spacy.displacy.render(model(t), style="span", jupyter=True)
    print(model.tokenizer.explain(t))

Your Environment

nsch0e commented 4 months ago

Sry, was a user error: the default suggester for spans uses ngrams of sizes 1,2,3. The bigger spans contain 5 tokens, so the suggester never suggested the complete span.

Fixed by using this config:

config = {
    "suggester": {"@misc": "spacy.ngram_suggester.v1", "sizes": [1, 2, 3, 4, 5]}
}
pipe = model.add_pipe("spancat", config=config)
svlandeg commented 4 months ago

No problem, thanks for reporting back! 🙏

github-actions[bot] commented 3 months ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.