Extra tokens in spans created by coref pipeline

itssimon commented 1 year ago

Since the v0.6.1 release of spacy-experimental the spans created by the coref pipeline incorrectly contain an extra token:

>>> import spacy
>>> nlp = spacy.load("en_coreference_web_trf")
>>> doc = nlp("John Smith called from New York, he says it's raining in the city.")
>>> doc.spans
{'coref_clusters_1': [John Smith called, he says], 'coref_clusters_2': [New York,, the city.]}

The verbs "called", "says" as well as the punctuation at the end of "New York," and "the city." shouldn't be included in the spans.

I suspect this bug was introduced here: https://github.com/explosion/spacy-experimental/pull/27

Your Environment

Operating System: Linux
Python Version Used: 3.10.6
spaCy Version Used: 3.4.2
spacy-experimental Version Used: 0.6.1

polm commented 1 year ago

Thanks for reporting this. I think that what is actually happening is there was an issue with the pretrained model, and it didn't include the fix you linked properly, and was closer to the 0.6.0 code than the 0.6.1 code.

I've updated the model at the spacy-experimental release (now a2 instead of a1) and verified the behavior locally. Could you confirm that fixes the issue for you?

itssimon commented 1 year ago

Yup, looks like that's done the trick. Thanks heaps for fixing it so quickly!

github-actions[bot] commented 1 year ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

Extra tokens in spans created by coref pipeline #11759

Your Environment