explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.31k stars 4.41k forks source link

E018 (Can't retrieve string for hash) error when fetching span IDs from SpanRuler #12407

Closed kdutia closed 1 year ago

kdutia commented 1 year ago

I got an E018 error when trying to retrieve Span.id_ when using SpanRuler. This only happened for certain IDs. I managed to fix it by finding a similar issue related to ent_id_, but I didn't want to submit a PR as I wasn't sure whether there would be side-effects of my changes.

Code to reproduce and what I did to fix it is below. Happy to open a PR with guidance.

How to reproduce the behaviour

# Create pipeline and add patterns
patterns = [{'label': 'EMPOWERMENT', 'id': 'empowerment of women', 'pattern': [{'LOWER': 'empowerment'}, {'LOWER': 'of'}, {'LOWER': 'women'}]}, {'label': 'EMPOWERMENT', 'id': 'female empowerment', 'pattern': [{'LOWER': 'female'}, {'LOWER': 'empowerment'}]}, {'label': 'VIOLENCE', 'id': 'gbv', 'pattern': [{'TEXT': 'GBV'}]}, {'label': 'VIOLENCE', 'id': 'gbvie', 'pattern': [{'TEXT': 'GBViE'}]}, {'label': 'EMPOWERMENT', 'id': 'GEM', 'pattern': [{'TEXT': 'GEM'}]}, {'label': 'GENDER', 'id': 'gender', 'pattern': [{'LEMMA': 'gender'}]}, {'label': 'WOMEN', 'id': 'female', 'pattern': [{'LEMMA': 'female'}]}, {'label': 'BIAS', 'id': 'gender bias', 'pattern': [{'LOWER': 'gender'}, {'LEMMA': 'bias'}]}, {'label': 'DEVELOPMENT', 'id': 'gender development index', 'pattern': [{'LOWER': 'gender'}, {'LOWER': 'development'}, {'LEMMA': 'index'}]}, {'label': 'DISCRIMINATION', 'id': 'gender discrimination', 'pattern': [{'LOWER': 'gender'}, {'LOWER': 'discrimination'}]}, {'label': 'EMPOWERMENT', 'id': 'gender empowerment measure', 'pattern': [{'LOWER': 'gender'}, {'LOWER': 'empowerment'}, {'LEMMA': 'measure'}]}, {'label': 'EQUALITY_EQUITY', 'id': 'gender equality', 'pattern': [{'LOWER': 'gender'}, {'LEMMA': 'equality'}]}, {'label': 'EQUALITY_EQUITY', 'id': 'gender equity', 'pattern': [{'LOWER': 'gender'}, {'LEMMA': 'equity'}]}, {'label': 'EQUALITY_EQUITY', 'id': 'gender gap', 'pattern': [{'LOWER': 'gender'}, {'IS_ASCII': True, 'OP': '?'}, {'LEMMA': 'gap'}]}, {'label': 'EQUALITY_EQUITY', 'id': 'gender inequality', 'pattern': [{'LOWER': 'gender'}, {'LEMMA': {'IN': ['inequality', 'inequity']}}]}, {'label': 'NORMS', 'id': 'gender norms', 'pattern': [{'LOWER': 'gender'}, {'LEMMA': 'norm'}]}, {'label': 'EQUALITY_EQUITY', 'id': 'gender parity', 'pattern': [{'LOWER': 'gender'}, {'LOWER': 'parity'}]}, {'label': 'ROLES', 'id': 'gender roles', 'pattern': [{'LOWER': 'gender'}, {'LEMMA': 'role'}]}, {'label': 'CONSTRAINTS', 'id': 'gender-based constraints', 'pattern': [{'LOWER': 'gender'}, {'IS_PUNCT': True, 'OP': '?'}, {'LOWER': 'based'}, {'LEMMA': 'constraint'}]}, {'label': 'VIOLENCE', 'id': 'gender based violence', 'pattern': [{'LOWER': 'gender'}, {'IS_PUNCT': True, 'OP': '?'}, {'LOWER': 'based'}, {'LEMMA': 'violence'}]}, {'label': 'IDENTITY', 'id': 'GID', 'pattern': [{'TEXT': 'GID'}]}, {'label': 'IDENTITY', 'id': 'gender identity', 'pattern': [{'LOWER': 'gender'}, {'LEMMA': 'identity'}]}, {'label': 'WOMEN', 'id': 'girls', 'pattern': [{'LEMMA': 'girl'}]}, {'label': 'MEN_PATRIARCHY', 'id': 'men', 'pattern': [{'LEMMA': {'IN': ['man', 'male']}}]}, {'label': 'MEN_PATRIARCHY', 'id': 'patriarchy', 'pattern': [{'LOWER': 'patriarchy'}]}, {'label': 'MEN_PATRIARCHY', 'id': 'patriarchal', 'pattern': [{'LOWER': 'patriarchal'}]}, {'label': 'REPRODUCTIVE_RIGHTS', 'id': 'reproductive rights', 'pattern': [{'LOWER': 'reproductive'}, {'LEMMA': 'right'}]}, {'label': 'REPRODUCTIVE_RIGHTS', 'id': 'reproductive health', 'pattern': [{'LOWER': 'reproductive'}, {'LOWER': 'health'}]}, {'label': 'REPRODUCTIVE_RIGHTS', 'id': 'sexual health', 'pattern': [{'LOWER': 'sexual'}, {'LOWER': 'health'}]}, {'label': 'REPRODUCTIVE_RIGHTS', 'id': 'srhr', 'pattern': [{'TEXT': 'SRHR'}]}, {'label': 'TRANSGENDER', 'id': 'transgender', 'pattern': [{'LOWER': 'transgender'}]}, {'label': 'DEVELOPMENT', 'id': 'WID', 'pattern': [{'TEXT': 'WID'}]}, {'label': 'WOMEN', 'id': 'women', 'pattern': [{'LEMMA': 'woman'}]}, {'label': 'DEVELOPMENT', 'id': 'women in development', 'pattern': [{'LOWER': 'women'}, {'LOWER': 'in'}, {'LOWER': 'development'}]}, {'label': 'EMPOWERMENT', 'id': "women's empowerment", 'pattern': [{'LOWER': 'women'}, {'ORTH': 's', 'OP': '?'}, {'LOWER': 'empowerment'}]}, {'label': 'EMPOWERMENT', 'id': "women's rights", 'pattern': [{'LOWER': 'women'}, {'ORTH': 's', 'OP': '?'}, {'LOWER': 'rights'}]}, {'label': 'EMPOWERMENT', 'id': 'rights of women', 'pattern': [{'LEMMA': 'right'}, {'LOWER': 'of'}, {'LEMMA': 'woman'}]}]

nlp = spacy.load("en_core_web_sm")
nlp.select_pipes(disable=["tok2vec","ner",])
ruler = nlp.add_pipe("span_ruler", config={"validate": True})
ruler.add_patterns(patterns) 

# Run text through pipeline and fail to get ID
text = """Women face additional health problems arising mainly from their reproductive
role. Inadequate access to reproductive health facilities and malnutrition are
the major factors for high maternal mortality, a rate currently estimated at 18
per 1000, three times higher than the average of 6 per 1000 for Sub-Saharan
Africa. Family planning is not widespread as reflected in the contraceptive
prevalence rate of only 6%."""

doc = nlp(text)
for span in doc.spans:
    print(span.id_)

The code above will fail on span 'reproductive health' with the error *** KeyError: "[E018] Can't retrieve string for hash '6675234125774895842'. This usually refers to an issue with the `Vocab` or `StringStore`."

Fix for behaviour

To fix this in my code, I added a step to add all span IDs to nlp.vocab.strings when adding the ruler to the pipeline. This was based on what I found in this issue and this part of doc.pyx.

I didn't want to open a PR in doc.pyx as I wasn't sure what the side effects of this would be.

nlp = spacy.load("en_core_web_sm")
nlp.select_pipes(disable=["tok2vec","ner",])
ruler = nlp.add_pipe("span_ruler", config={"validate": True})
ruler.add_patterns(patterns) 

# Added this part as suggested by the linked issue above
for id in ruler.ids:
    nlp.vocab.strings.add(id)

Your Environment

adrianeboyd commented 1 year ago

Thanks for the report, that does sound like a bug.

I couldn't replicate this error with your example, so I looked through most of the relevant places in the code where span IDs are added. I found one bug in Doc.char_span where the span ID isn't added correctly to the string store, but I couldn't find anywhere in the span ruler to fix this particular kind of bug.

Can you double-check that it was this exact example that led to this error? (In particular, for span in doc.spans is going to return the doc.spans keys, not individual spans, so maybe some part of the example didn't get copied correctly above?)

Does this simplified version of your example lead to the same error on your end?

import spacy

# Create pipeline and add patterns
patterns = [{'label': 'REPRODUCTIVE_RIGHTS', 'id': 'reproductive health', 'pattern': [{'LOWER': 'reproductive'}, {'LOWER': 'health'}]}]

nlp = spacy.blank("en")
ruler = nlp.add_pipe("span_ruler", config={"validate": True})
ruler.add_patterns(patterns)

# Run text through pipeline and fail to get ID
text = """Women face additional health problems arising mainly from their reproductive
role. Inadequate access to reproductive health facilities and malnutrition are
the major factors for high maternal mortality, a rate currently estimated at 18
per 1000, three times higher than the average of 6 per 1000 for Sub-Saharan
Africa. Family planning is not widespread as reflected in the contraceptive
prevalence rate of only 6%."""

doc = nlp(text)
for span in doc.spans["ruler"]:
    print(span.id_)
adrianeboyd commented 1 year ago

As a note, the bug I thought I found in Doc.char_span turned out not to be a bug (see #12429).

kdutia commented 1 year ago

Thanks @adrianeboyd – the example you posted works fine, but my code still doesn't. I'm just trying to figure out what simplifications I've made to go from this code to my code example that are causing the issue to not appear.

Could it be anything to do with the fact that I'm using nlp.pipe across multiple processes, or using the as_tuples kwarg?

nlp.pipe(
    text_tuples, as_tuples=True, n_process=6, batch_size=1024
)

I'll have a more detailed look again tomorrow.

adrianeboyd commented 1 year ago

Ah, multiprocessing is the right clue, now I see where the bug is. It's not related to span ruler.

kdutia commented 1 year ago

Thanks! What's the behaviour of spaCy which means this issue is only faced when using multiprocessing? (I looked in the docs but couldn't find any details). Is there anything to consider when developing custom pipeline components?

adrianeboyd commented 1 year ago

It's not only in multiprocessing, but multiprocessing uses Doc.to_bytes/from_bytes to send the doc data between processes. Usually this isn't something you need to be aware of.

If you need a workaround with v3.5.1, you can explicitly add the IDs to the string store with nlp.vocab.strings.add(string). If you add them once, they'll be saved with nlp.to_disk when you save the pipeline and always be available in the future for this particular pipeline with span ruler.

The other main thing to be aware of related to multiprocessing is that if you are using custom extensions, you should re-add any custom extensions in __call__ and pipe (if pipe is implemented) because they won't be part of the global context with multiprocessing with spawn.

github-actions[bot] commented 1 year ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.