Closed kdutia closed 1 year ago
Thanks for the report, that does sound like a bug.
I couldn't replicate this error with your example, so I looked through most of the relevant places in the code where span IDs are added. I found one bug in Doc.char_span
where the span ID isn't added correctly to the string store, but I couldn't find anywhere in the span ruler to fix this particular kind of bug.
Can you double-check that it was this exact example that led to this error? (In particular, for span in doc.spans
is going to return the doc.spans
keys, not individual spans, so maybe some part of the example didn't get copied correctly above?)
Does this simplified version of your example lead to the same error on your end?
import spacy
# Create pipeline and add patterns
patterns = [{'label': 'REPRODUCTIVE_RIGHTS', 'id': 'reproductive health', 'pattern': [{'LOWER': 'reproductive'}, {'LOWER': 'health'}]}]
nlp = spacy.blank("en")
ruler = nlp.add_pipe("span_ruler", config={"validate": True})
ruler.add_patterns(patterns)
# Run text through pipeline and fail to get ID
text = """Women face additional health problems arising mainly from their reproductive
role. Inadequate access to reproductive health facilities and malnutrition are
the major factors for high maternal mortality, a rate currently estimated at 18
per 1000, three times higher than the average of 6 per 1000 for Sub-Saharan
Africa. Family planning is not widespread as reflected in the contraceptive
prevalence rate of only 6%."""
doc = nlp(text)
for span in doc.spans["ruler"]:
print(span.id_)
As a note, the bug I thought I found in Doc.char_span
turned out not to be a bug (see #12429).
Thanks @adrianeboyd – the example you posted works fine, but my code still doesn't. I'm just trying to figure out what simplifications I've made to go from this code to my code example that are causing the issue to not appear.
Could it be anything to do with the fact that I'm using nlp.pipe
across multiple processes, or using the as_tuples
kwarg?
nlp.pipe(
text_tuples, as_tuples=True, n_process=6, batch_size=1024
)
I'll have a more detailed look again tomorrow.
Ah, multiprocessing is the right clue, now I see where the bug is. It's not related to span ruler.
Thanks! What's the behaviour of spaCy which means this issue is only faced when using multiprocessing? (I looked in the docs but couldn't find any details). Is there anything to consider when developing custom pipeline components?
It's not only in multiprocessing, but multiprocessing uses Doc.to_bytes/from_bytes
to send the doc data between processes. Usually this isn't something you need to be aware of.
If you need a workaround with v3.5.1, you can explicitly add the IDs to the string store with nlp.vocab.strings.add(string)
. If you add them once, they'll be saved with nlp.to_disk
when you save the pipeline and always be available in the future for this particular pipeline with span ruler.
The other main thing to be aware of related to multiprocessing is that if you are using custom extensions, you should re-add any custom extensions in __call__
and pipe
(if pipe
is implemented) because they won't be part of the global context with multiprocessing with spawn.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I got an E018 error when trying to retrieve
Span.id_
when usingSpanRuler
. This only happened for certain IDs. I managed to fix it by finding a similar issue related toent_id_
, but I didn't want to submit a PR as I wasn't sure whether there would be side-effects of my changes.Code to reproduce and what I did to fix it is below. Happy to open a PR with guidance.
How to reproduce the behaviour
The code above will fail on span 'reproductive health' with the error
*** KeyError: "[E018] Can't retrieve string for hash '6675234125774895842'. This usually refers to an issue with the `Vocab` or `StringStore`."
Fix for behaviour
To fix this in my code, I added a step to add all span IDs to
nlp.vocab.strings
when adding the ruler to the pipeline. This was based on what I found in this issue and this part of doc.pyx.I didn't want to open a PR in
doc.pyx
as I wasn't sure what the side effects of this would be.Your Environment