Closed ANenashev closed 3 years ago
The workaround is to annotate your training corpus with sentencizer
in advance:
import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")
docs = DocBin().from_disk("train.spacy").get_docs(nlp.vocab)
docs = nlp.get_pipe("sentencizer").pipe(docs)
new_db = DocBin(docs=docs)
new_db.to_disk("train_with_sents.spacy")
It is a known issue that pipelines where one component depends on the annotation from an earlier component aren't supported at all in the current training setup. We're planning to add a [training]
option to support this in v3.1.
In the provided pipelines we use strided_spans
instead of sent_spans
so sentence annotation isn't required during training, which could also be an alternative.
Thank you for quick reply!
Unfortunately, this method is not working for me. I see that sentence boundaries are set in reference
doc of example
instance and are missing in predicted
doc. I'll try strided spans for now.
Ah, sorry, you're right. You'd have to customize the corpus reader instead so that it adds the sentence boundaries to example.predicted
in the examples.
(Edited: I should say more clearly I hope that would work, but I haven't tested it.)
Customized corpus reader works for me. Thank you, @adrianeboyd!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
How to reproduce the behaviour
I trying to train custom text classifier on top of BERT embeddings. I use
spacy-transformers.sent_spans.v1
which requires sentence boundaries to be set. I addedsentencizer
to the beginning of pipeline. I ranpython -m spacy train training/config.cfg --output en_clf -c ./bert_clauses_classifier/clf_pipe.py -V
with following config:I'm getting following error:
I noticed that sentencizer is not called here on example's
predicted
instance because it hasn't update method.Please advise walkaround of this issue.
Your Environment
Info about spaCy