Closed erip closed 4 years ago
Strangely, in this example offsets_from_biluo_tags(doc, tags[0:len(doc)]) == []
, so there goes my theory. 😄
Oh, duh. These are BIO and not BILOU, so it's not surprising that it isn't working. 😄 Any chance you'd support a offsets_from_bio_tags
API? I'd be happy to submit a PR.
spacy.gold.iob_to_biluo(tags)
will convert IOB to BILUO tags:
print(spacy.gold.iob_to_biluo(tags))
# ['O', 'O', 'O', 'O', 'O', 'U-PER', 'O', 'O', 'O', 'B-NOH', 'L-NOH', 'O', 'B-NOH', 'L-NOH', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
(You probably still have some tokenization alignment issues to handle, though!)
@adrianeboyd thanks for this - somehow missed that one...
wrt tokenization alignment, I suspect the way to do this is via implementing tokenization exceptions?
I don't know much about Korean tokenization. I think the default tokenizer is using mecab with a Korean dictionary, so it could depend a bit on how many cases don't align between your data and the default tokenization. If there a lot of differences, then you might want to customize the dictionary. If it's only a few cases where you need to adjust the tokenization, then tokenizer exceptions are a good option, too.
If you're training a new model, then spacy has some ability to align mismatched tokens between the tokenizer and the training data and learn as much as it can from the good alignments. The parser can also learn to merge subtokens if the tokenizer oversplits, but it's still a bit experimental.
I've been working on training a model for Chinese, where there a lot of misalignments between the default tokenizer (jieba) and the training data (OntoNotes). The results are not great so far, though, either with jieba (lots of misalignments) or trying to learn how to merge subtokens after splitting each text into individual characters, where the subtok learning/merging isn't performing as well as intended yet.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
It seems like if
len(doc) != len(tags)
,offsets_from_bilou_tags(doc, tags)
will not return the appropriate entities list.The data in the example come from sentence 1 here.
How to reproduce the behaviour
Your Environment
Info about spaCy