dsindex / ntagger

reference pytorch code for named entity tagging
86 stars 13 forks source link

(Maybe) wrong implementation of crf layer #1

Closed hasansalimkanmaz closed 3 years ago

hasansalimkanmaz commented 3 years ago

I am working on adding a crf layer on top of bert-like model. I am stuck with subtokens for now.

Let me explain my situation;

I am using pad_token_label_id=-100 by default and this leads to ignoring subtokens while calculating the loss as expected. However when I try to add crf layer on top of bert, this pad_token_label_id results in IndexError in crf layer. Because crf layer tries to find the label with index -100.

Possible problem with your implementation:

What should be implemented?

In crf layer, we shouldn't take `pad_token_label_id's into account. Because sub_tokens don't have any label for tagging task. we need to eliminate these subtokens before crf layer like we do with attention_mask.

If you need me to elaborate on this issue, let me know what is missing above.

Thanks in advance.

dsindex commented 3 years ago

@hasansalimkanmaz

i agree with you. inconsistent results with bert+crf may support your thought.

because of that point, i had implemented a tensorflow code for bert subtoken alignment.

https://github.com/dsindex/etagger/blob/master/feed.py#L79 """Align bert_embeddings via bert_wordidx2tokenidx ex) word : 'johanson was a guy to' [0 ~ 4] token : 'johan ##son was a gu ##y t ##o' [0 ~ 7] wordidx2tokenidx : [1 3 4 5 7 9 0 0 ...] (bert embedding begins with [CLS] token) bert embedding : [em('CLS'), em('johan'), em('##son'), em('was'), em('a'), em('gu'), em('##y'), em('t'), em('##o'), 0, ...] """

however, that was a feature-based approach of using bert embeddings(no fine-tuning bert). and i have not yet try to implement in pytorch.

dsindex commented 3 years ago

@hasansalimkanmaz

i had an experiment like below:

case 1) set subword label to pad_token_label_id=0

case 2) set subword label to original label except the default label 'O'

train

$ python train.py --config=configs/config-bert.json --data_dir=data/conll2003 --save_path=pytorch-model-bert.pt --bert_model_name_or_path=./embeddings/bert-base-cased --bert_output_dir=bert-checkpoint --batch_size=32 --lr=1e-5 --epoch=10 --bert_freezing_epoch=3 --bert_lr_during_freezing=1e-3 --use_crf

evaluate

$ python evaluate.py --config=configs/config-bert.json --data_dir=data/conll2003 --model_path=pytorch-model-bert.pt --bert_output_dir=bert-checkpoint --use_crf $ cd data/conll2003; perl ../../etc/conlleval.pl < test.txt.pred ; cd ../..

  - util_bert.py
for word, pos, label in zip(example.words, example.poss, example.labels):
    # word extension
    word_tokens = tokenizer.tokenize(word)
    tokens.extend(word_tokens)
    # pos extension: set same pos_id
    pos_id = pos_map[pos]
    pos_ids.extend([pos_id] + [pos_id] * (len(word_tokens) - 1))
    # label extension: set pad_token_label_id
    label_id = label_map[label]
    if opt.bert_use_sub_label:
        if label == config['default_label']:
            # ex) 'round', '##er' -> 1/'O', 1/'O'
            sub_token_label = label
            sub_token_label_id = label_map[sub_token_label]
            label_ids.extend([label_id] + [sub_token_label_id] * (len(word_tokens) - 1))
        else:
            # ex) 'BR', '##US', '##SE', '##LS' -> 6/'B-LOC', 9/'I-LOC', 9/'I-LOC', 9/'I-LOC'
            sub_token_label = label
            prefix, suffix = label.split('-', maxsplit=1)
            if prefix == 'B': sub_token_label = 'I-' + suffix
            sub_token_label_id = label_map[sub_token_label]
            label_ids.extend([label_id] + [sub_token_label_id] * (len(word_tokens) - 1))
    else:
        label_ids.extend([label_id] + [pad_token_label_id] * (len(word_tokens) - 1))
  - since i change the subword label to 'I-' label sequences, the CRF layer on the top should be sound.
  - `seqeval` unnecessarily tasks into consideration subword labels. 
  - therefore, its F1 is not the final score we want. we should print out the prediction results to a file and then use `conlleval.pl` script to evaluate.
  - evaluate.py
# write prediction
try:
    pred_path = opt.test_path + '.pred'
    with open(pred_path, 'w', encoding='utf-8') as f:
        for i, bucket in enumerate(data):      # foreach sentence
            if i >= ys.shape[0]:
                logger.info("[Stop to write predictions] : %s" % (i))
                break
            use_subtoken = False
            ys_idx = 0
            if config['emb_class'] not in ['glove', 'elmo']:
                use_subtoken = True
                ys_idx = 1 # account '[CLS]'
            for j, entry in enumerate(bucket): # foreach token
                entry = bucket[j]
                pred_label = default_label
                if ys_idx < ys.shape[1]:
                    pred_label = labels[preds[i][ys_idx]]
                entry.append(pred_label)
                f.write(' '.join(entry) + '\n')
                if use_subtoken:
                    word = entry[0]
                    word_tokens = model.bert_tokenizer.tokenize(word)
                    ys_idx += len(word_tokens)
                else:
                    ys_idx += 1
            f.write('\n')


as a result, i got a slightly better F1 score.

<img width="757" alt="스크린샷 2021-02-19 오후 9 45 15" src="https://user-images.githubusercontent.com/8259057/108506288-e2102a00-72fb-11eb-97df-d1b1e760f740.png">
hasansalimkanmaz commented 3 years ago

Thanks for your fast response. I think your approach is not what current trend expects. we shouldn't give a label to subwords according to this. Anyway, it is interesting to see that it returns slightly better results.

Currently, I am busy with something else, I can't go on with my crf work. If I will, I will let you know via this thread.

dsindex commented 3 years ago

@hasansalimkanmaz

i just done another experiment.

case 3) slicing the embeddings from BERT layer(i.e, logits) to remain only the first token's of the word's

as you pointed out, i tried to remove all subword embeddings except the first one from the word. doing so, we could accomplish a sound usage to the crf layer and also eliminate needless computation cost.

  1. modification summarized
  1. usage:
    # slicing logits to remain first token's of word's before applying crf, --bert_use_crf_slice
    # preprocessing
    $ python preprocess.py --config=configs/config-bert.json --data_dir=data/conll2003 --bert_model_name_or_path=./embeddings/bert-base-cased
    # train
    $ python train.py --config=configs/config-bert.json --data_dir=data/conll2003 --save_path=pytorch-model-bert.pt --bert_model_name_or_path=./embeddings/bert-base-cased --bert_output_dir=bert-checkpoint --batch_size=32 --lr=1e-5 --epoch=10 --bert_freezing_epoch=3 --bert_lr_during_freezing=1e-3 --use_crf --bert_use_crf_slice
    # evaluate
    $ python evaluate.py --config=configs/config-bert.json --data_dir=data/conll2003 --model_path=pytorch-model-bert.pt --bert_output_dir=bert-checkpoint --use_crf --bert_use_crf_slice
    $ cd data/conll2003; perl ../../etc/conlleval.pl < test.txt.pred ; cd ../..
    INFO:__main__:[F1] : 0.913277459197177, 3684
    INFO:__main__:[Elapsed Time] : 3684 examples, 151587.14032173157ms, 41.12043155459907ms on average
    accuracy:  98.26%; precision:  91.01%; recall:  91.64%; FB1:  91.33

however, despite i had expected a better result, the F1 score by this approach may not be statistically significant though.

스크린샷 2021-02-20 오후 10 18 45
hasansalimkanmaz commented 3 years ago

thank you very much @dsindex for the info. I will let you know when I have done the similar experiment with my own setting.

dsindex commented 3 years ago

A comparison --bert_use_sub_label vs --bert_use_crf_slice

  1. CoNLL 2003 (English)
스크린샷 2021-03-01 오후 2 07 18 스크린샷 2021-03-01 오후 2 07 56
  1. Naver NER (Korean)
스크린샷 2021-03-01 오후 2 11 55 스크린샷 2021-03-01 오후 2 12 20

it shows interesting results. generally there are more subword tokens in Korean dataset compared to CoNLL 2003. so, i guess slicing subword logits works better for it.

hasansalimkanmaz commented 3 years ago

I have conducted my experiment by training LayoutLM model for scanned documents. Unfortunately, I can't say that results are better. They are very close to each other. Maybe, these experiments explain us why community doesn't have any tendency to using it.

image

hasansalimkanmaz commented 3 years ago

Feel free to close the issue @dsindex Thanks for your efforts.

dsindex commented 3 years ago

@hasansalimkanmaz very appreciate :)

dsindex commented 3 years ago
스크린샷 2021-03-25 오후 8 33 06 스크린샷 2021-03-25 오후 8 33 15 스크린샷 2021-03-25 오후 8 33 41 스크린샷 2021-04-02 오후 1 32 32 스크린샷 2021-04-02 오후 1 32 44
dsindex commented 3 years ago

i try to change —bert_use_crf_slice option to —bert_use_subword_pooling. so, release backup code(https://github.com/dsindex/ntagger/releases/tag/v1.0) before modification.