Closed F0rge1cE closed 5 years ago
This looks like a bug. I'll take a look.
For a temporary solution that uses -c ner
, see: https://stackoverflow.com/q/57551479/461847
(The OCR errors are probably going to cause a lot of problems and I suspect that the lack of sentence/document boundaries is going to cause problems during training. Even if automatic sentence boundaries aren't perfect, it would be better to divide the data into sentence-ish segments for shuffling/batching during training.)
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
How to reproduce the behaviour
I try to call
python -m spacy convert <path> ./ -t json -c iob
with this labeled NER training corpus: https://github.com/EuropeanaNewspapers/ner-corpora/blob/master/enp_DE.onb.bio/enp_DE.onb.bioAfter the conversion, you can get a json file with content:
Such behavior is really weired. Seems that every separate token is considered as a whole sentence and a paragraph.
Another issue has already mentioned here: #4111 , which will cause exception when the token contains any non-word char ("[^\w-]" in the regex scope).
Your Environment