explosion / sense2vec

🦆 Contextually-keyed word vectors
https://explosion.ai/blog/sense2vec-reloaded
MIT License
1.62k stars 240 forks source link

Is there any way to use "doc.spans" in 01_parse.py? #142

Open nonstoprunning opened 3 years ago

nonstoprunning commented 3 years ago

Hi, I am trying to built a sense2vec model with new data. I have made few changes in 01_parse.py. First, I have removed the default ner pipe coming with "en_core_web_lg". Then I have added a new Language.component where I identify Spans associated to a new entities (new labels) in a doc. Sometimes, I would like to assign a Span[x, y] to more than one entity but I can not. My question... I have read the new changes in spaCy v3.1. Is there a way to use "doc.spans" (or something similar) in 01_parse where SpaCy's internal algorithms take Spans overlap into account?

@Language.component("name_comp") def my_component(doc):
matches = matcher(doc) seen_tokens = set() new_entities = [] entities = doc.ents for match_id, start, end in matches:

check for end - 1 here because boundaries are inclusive

    if start not in seen_tokens and end - 1 not in seen_tokens:
        new_entities.append(Span(doc, start, end, label=match_id))
        entities = [
            e for e in entities if not (e.start < end and e.end > start)
        ]
        seen_tokens.update(range(start, end))
doc.ents = tuple(entities) + tuple(new_entities)
return doc

Thanks in advance, Paula