explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.68k stars 4.36k forks source link

offsets_from_bilou_tags ignores tags in Korean #4578

Closed erip closed 4 years ago

erip commented 4 years ago

It seems like if len(doc) != len(tags), offsets_from_bilou_tags(doc, tags) will not return the appropriate entities list.

The data in the example come from sentence 1 here.

How to reproduce the behaviour

>>> from spacy.lang.ko import Korean
>>> from spacy.gold import offsets_from_biluo_tags
>>> nlp = Korean()
>>> text = "이어 옆으로 움직여 김일성의 오른쪽에서 한 차례씩 두 번  상체를 굽혀 조문했으며 이윽고 안경을 벗고 손수건으로 눈주위를 닦기도 했다."
>>> doc = nlp(text)
>>> tags = ['O', 'O', 'O', 'O', 'O', 'B-PER', 'O', 'O', 'O', 'B-NOH', 'I-NOH', 'O', 'B-NOH', 'I-NOH', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
>>> entities = offsets_from_biluo_tags(doc, tags)
>>> len(doc), len(tags)
(35, 36)
>>> entities
[]

Your Environment

Info about spaCy

erip commented 4 years ago

Strangely, in this example offsets_from_biluo_tags(doc, tags[0:len(doc)]) == [], so there goes my theory. 😄

erip commented 4 years ago

Oh, duh. These are BIO and not BILOU, so it's not surprising that it isn't working. 😄 Any chance you'd support a offsets_from_bio_tags API? I'd be happy to submit a PR.

adrianeboyd commented 4 years ago

spacy.gold.iob_to_biluo(tags) will convert IOB to BILUO tags:

print(spacy.gold.iob_to_biluo(tags))
# ['O', 'O', 'O', 'O', 'O', 'U-PER', 'O', 'O', 'O', 'B-NOH', 'L-NOH', 'O', 'B-NOH', 'L-NOH', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

(You probably still have some tokenization alignment issues to handle, though!)

erip commented 4 years ago

@adrianeboyd thanks for this - somehow missed that one...

wrt tokenization alignment, I suspect the way to do this is via implementing tokenization exceptions?

adrianeboyd commented 4 years ago

I don't know much about Korean tokenization. I think the default tokenizer is using mecab with a Korean dictionary, so it could depend a bit on how many cases don't align between your data and the default tokenization. If there a lot of differences, then you might want to customize the dictionary. If it's only a few cases where you need to adjust the tokenization, then tokenizer exceptions are a good option, too.

If you're training a new model, then spacy has some ability to align mismatched tokens between the tokenizer and the training data and learn as much as it can from the good alignments. The parser can also learn to merge subtokens if the tokenizer oversplits, but it's still a bit experimental.

I've been working on training a model for Chinese, where there a lot of misalignments between the default tokenizer (jieba) and the training data (OntoNotes). The results are not great so far, though, either with jieba (lots of misalignments) or trying to learn how to merge subtokens after splitting each text into individual characters, where the subtok learning/merging isn't performing as well as intended yet.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.