ai-forever / ner-bert

BERT-NER (nert-bert) with google bert https://github.com/google-research.
MIT License
407 stars 97 forks source link

BIO vs IO #22

Closed Ulitochka closed 5 years ago

Ulitochka commented 5 years ago

Hello.

In your example (https://github.com/sberbank-ai/ner-bert/blob/master/examples/factrueval-nmt.ipynb) you are using bio markups. But in code (bert_data.py (187)):

  # prev_label = ""

for idx_, (orig_token, label) in enumerate(zip(orig_tokens, labels)):

Fix BIO to IO as BERT proposed https://arxiv.org/pdf/1810.04805.pdf

        try:

you use io. how do you get the original markup after training?

king-menin commented 5 years ago

We can extract original labels by replace first token label to B-PER. For example: I-O I-O I-PER I-PER I-O -> I-O I-O B-PER I-PER I-O

Ulitochka commented 5 years ago

If we have 2 entities? I-O I-O I-PER I-PER I-PER I-PER I-O -> I-O I-O B-PER I-PER I-PER B-PER I-O

king-menin commented 5 years ago

I is fail for now, but this situation is very rare. You can return BIO markup.

Ulitochka commented 5 years ago

This Fix BIO to IO as BERT proposed https://arxiv.org/pdf/1810.04805.pdf - increases quality?

king-menin commented 5 years ago

For our inner tasks, BUT. I took the second place in AGGR-2019 competition with BIO markup. SO u can pick code from here (with BIO).

Ulitochka commented 5 years ago

Ok, thanks!