hasansalimkanmaz commented 3 years ago

I am working on adding a crf layer on top of bert-like model. I am stuck with subtokens for now.

Let me explain my situation;

I am using pad_token_label_id=-100 by default and this leads to ignoring subtokens while calculating the loss as expected. However when I try to add crf layer on top of bert, this pad_token_label_id results in IndexError in crf layer. Because crf layer tries to find the label with index -100.

Possible problem with your implementation:

I see that you are using pad_token_label_id=0 which is weird because in this case subtokens are also calculated in the loss functions which is not what we expected for a tagging task.
Another point is that you are not filtering these subtokens in crf layer. This means that crf layer can encounter every possible transition in tagging and this is against the core idea of crf layer.

What should be implemented?

In crf layer, we shouldn't take `pad_token_label_id's into account. Because sub_tokens don't have any label for tagging task. we need to eliminate these subtokens before crf layer like we do with attention_mask.

If you need me to elaborate on this issue, let me know what is missing above.

Thanks in advance.

dsindex commented 3 years ago

@hasansalimkanmaz

i agree with you. inconsistent results with bert+crf may support your thought.

because of that point, i had implemented a tensorflow code for bert subtoken alignment.

https://github.com/dsindex/etagger/blob/master/feed.py#L79 """Align bert_embeddings via bert_wordidx2tokenidx ex) word : 'johanson was a guy to' [0 ~ 4] token : 'johan ##son was a gu ##y t ##o' [0 ~ 7] wordidx2tokenidx : [1 3 4 5 7 9 0 0 ...] (bert embedding begins with [CLS] token) bert embedding : [em('CLS'), em('johan'), em('##son'), em('was'), em('a'), em('gu'), em('##y'), em('t'), em('##o'), 0, ...] """

however, that was a feature-based approach of using bert embeddings(no fine-tuning bert). and i have not yet try to implement in pytorch.

dsindex commented 3 years ago

@hasansalimkanmaz

i had an experiment like below:

case 1) set subword label to pad_token_label_id=0

current approach

- ex) 'BR', '##US', '##SE', '##LS' -> 6/'B-LOC', 0/'<pad>', 0/'<pad>', 0/'<pad>'

case 2) set subword label to original label except the default label 'O'

set a label to subword
ex) 'BR', '##US', '##SE', '##LS' -> 6/'B-LOC', 9/'I-LOC', 9/'I-LOC', 9/'I-LOC'
modification
- https://github.com/dsindex/ntagger/commit/1a7435cb93cafcbcf4882388ef4d1e3835db00f9

usage


# using sub token label
# preprocessing
$ python preprocess.py --config=configs/config-bert.json --data_dir=data/conll2003 --bert_model_name_or_path=./embeddings/bert-base-cased --bert_use_sub_label

train

$ python train.py --config=configs/config-bert.json --data_dir=data/conll2003 --save_path=pytorch-model-bert.pt --bert_model_name_or_path=./embeddings/bert-base-cased --bert_output_dir=bert-checkpoint --batch_size=32 --lr=1e-5 --epoch=10 --bert_freezing_epoch=3 --bert_lr_during_freezing=1e-3 --use_crf

evaluate

$ python evaluate.py --config=configs/config-bert.json --data_dir=data/conll2003 --model_path=pytorch-model-bert.pt --bert_output_dir=bert-checkpoint --use_crf $ cd data/conll2003; perl ../../etc/conlleval.pl < test.txt.pred ; cd ../..

  - util_bert.py

for word, pos, label in zip(example.words, example.poss, example.labels):
    # word extension
    word_tokens = tokenizer.tokenize(word)
    tokens.extend(word_tokens)
    # pos extension: set same pos_id
    pos_id = pos_map[pos]
    pos_ids.extend([pos_id] + [pos_id] * (len(word_tokens) - 1))
    # label extension: set pad_token_label_id
    label_id = label_map[label]
    if opt.bert_use_sub_label:
        if label == config['default_label']:
            # ex) 'round', '##er' -> 1/'O', 1/'O'
            sub_token_label = label
            sub_token_label_id = label_map[sub_token_label]
            label_ids.extend([label_id] + [sub_token_label_id] * (len(word_tokens) - 1))
        else:
            # ex) 'BR', '##US', '##SE', '##LS' -> 6/'B-LOC', 9/'I-LOC', 9/'I-LOC', 9/'I-LOC'
            sub_token_label = label
            prefix, suffix = label.split('-', maxsplit=1)
            if prefix == 'B': sub_token_label = 'I-' + suffix
            sub_token_label_id = label_map[sub_token_label]
            label_ids.extend([label_id] + [sub_token_label_id] * (len(word_tokens) - 1))
    else:
        label_ids.extend([label_id] + [pad_token_label_id] * (len(word_tokens) - 1))

  - since i change the subword label to 'I-' label sequences, the CRF layer on the top should be sound.
  - `seqeval` unnecessarily tasks into consideration subword labels. 
  - therefore, its F1 is not the final score we want. we should print out the prediction results to a file and then use `conlleval.pl` script to evaluate.
  - evaluate.py

# write prediction
try:
    pred_path = opt.test_path + '.pred'
    with open(pred_path, 'w', encoding='utf-8') as f:
        for i, bucket in enumerate(data):      # foreach sentence
            if i >= ys.shape[0]:
                logger.info("[Stop to write predictions] : %s" % (i))
                break
            use_subtoken = False
            ys_idx = 0
            if config['emb_class'] not in ['glove', 'elmo']:
                use_subtoken = True
                ys_idx = 1 # account '[CLS]'
            for j, entry in enumerate(bucket): # foreach token
                entry = bucket[j]
                pred_label = default_label
                if ys_idx < ys.shape[1]:
                    pred_label = labels[preds[i][ys_idx]]
                entry.append(pred_label)
                f.write(' '.join(entry) + '\n')
                if use_subtoken:
                    word = entry[0]
                    word_tokens = model.bert_tokenizer.tokenize(word)
                    ys_idx += len(word_tokens)
                else:
                    ys_idx += 1
            f.write('\n')



as a result, i got a slightly better F1 score.

<img width="757" alt="스크린샷 2021-02-19 오후 9 45 15" src="https://user-images.githubusercontent.com/8259057/108506288-e2102a00-72fb-11eb-97df-d1b1e760f740.png">

hasansalimkanmaz commented 3 years ago

Thanks for your fast response. I think your approach is not what current trend expects. we shouldn't give a label to subwords according to this. Anyway, it is interesting to see that it returns slightly better results.

Currently, I am busy with something else, I can't go on with my crf work. If I will, I will let you know via this thread.

dsindex commented 3 years ago

@hasansalimkanmaz

i just done another experiment.

case 3) slicing the embeddings from BERT layer(i.e, logits) to remain only the first token's of the word's

as you pointed out, i tried to remove all subword embeddings except the first one from the word. doing so, we could accomplish a sound usage to the crf layer and also eliminate needless computation cost.

modification summarized

https://github.com/dsindex/ntagger/commit/b3c7a0b1a55e349b826911c30f8d6d7adbba3b44

first, building word2token_idx in util_bert.py

 word      : the dog is hairy
  word_idx  : 0   1   2  3
  ------------------------------------------------------------------
  tokens:        [CLS] the dog is ha ##iry . [SEP] <pad> <pad> <pad> ...
  token_idx:       0   1   2   3  4  5     6   7     8     9     10  ...
  input_ids:       x   x   x   x  x  x     x   x     0     0     0   ...
  segment_ids:     0   0   0   0  0  0     0   0     0     0     0   ...
  input_mask:      1   1   1   1  1  1     1   1     0     0     0   ...
  label_ids:       0   1   1   1  1  0     1   0     0     0     0   ...
  ------------------------------------------------------------------
  idx              0   1   2   3
  word2token_idx:  1   2   3   4  0  0  0 ...
  word2token_idx[idx] = token_idx

second, slicing logits before applying crf in model.py

    if not self.use_crf: return logits
    if self.use_crf and self.use_crf_slice:
        word2token_idx = x[4]
        mask_word2token_idx = torch.sign(torch.abs(word2token_idx)).to(torch.uint8).unsqueeze(-1).to(self.device)
        # slice logits to remain first token's of word's before applying crf.
        # solution from https://stackoverflow.com/questions/55628014/indexing-a-3d-tensor-using-a-2d-tensor
        offset = torch.arange(0, logits.size(0) * logits.size(1), logits.size(1)).to(self.device)
        index = word2token_idx + offset.unsqueeze(1)
        logits = logits.reshape(-1, logits.shape[-1])[index]
        logits *= mask_word2token_idx
    prediction = self.crf.decode(logits)

especially thanks to wasi-ahmad(author of the above stackoverflow article)

third, slicing y(gold label) before computing cross-entropy loss in train.py, evaluate.py

    x = to_device(x, opt.device)
    y = to_device(y, opt.device)
    if opt.use_crf:
        with autocast(enabled=opt.use_amp):
            mask = x[1].to(torch.uint8)
            if opt.bert_use_crf_slice:
                # slice y to remain first token's of word's.
                word2token_idx = x[4]
                mask = torch.sign(torch.abs(word2token_idx)).to(torch.uint8).to(opt.device)
                y = y.gather(1, word2token_idx)
                y *= mask

usage:

# slicing logits to remain first token's of word's before applying crf, --bert_use_crf_slice
# preprocessing
$ python preprocess.py --config=configs/config-bert.json --data_dir=data/conll2003 --bert_model_name_or_path=./embeddings/bert-base-cased
# train
$ python train.py --config=configs/config-bert.json --data_dir=data/conll2003 --save_path=pytorch-model-bert.pt --bert_model_name_or_path=./embeddings/bert-base-cased --bert_output_dir=bert-checkpoint --batch_size=32 --lr=1e-5 --epoch=10 --bert_freezing_epoch=3 --bert_lr_during_freezing=1e-3 --use_crf --bert_use_crf_slice
# evaluate
$ python evaluate.py --config=configs/config-bert.json --data_dir=data/conll2003 --model_path=pytorch-model-bert.pt --bert_output_dir=bert-checkpoint --use_crf --bert_use_crf_slice
$ cd data/conll2003; perl ../../etc/conlleval.pl < test.txt.pred ; cd ../..
INFO:__main__:[F1] : 0.913277459197177, 3684
INFO:__main__:[Elapsed Time] : 3684 examples, 151587.14032173157ms, 41.12043155459907ms on average
accuracy:  98.26%; precision:  91.01%; recall:  91.64%; FB1:  91.33

however, despite i had expected a better result, the F1 score by this approach may not be statistically significant though.

hasansalimkanmaz commented 3 years ago

thank you very much @dsindex for the info. I will let you know when I have done the similar experiment with my own setting.

dsindex commented 3 years ago

A comparison --bert_use_sub_label vs --bert_use_crf_slice

all experiments had done with BERT-LSTM-CRF.
--bert_use_sub_label: assign 'I-' labels to all subword tokens.
--bert_use_crf_slice: remove subword slice of logits before applying crf layer.

CoNLL 2003 (English)
- https://github.com/dsindex/ntagger#conll-2003-english-1

Naver NER (Korean)
- https://github.com/dsindex/ntagger#naver-ner-2019-korean-1

it shows interesting results. generally there are more subword tokens in Korean dataset compared to CoNLL 2003. so, i guess slicing subword logits works better for it.

hasansalimkanmaz commented 3 years ago

I have conducted my experiment by training LayoutLM model for scanned documents. Unfortunately, I can't say that results are better. They are very close to each other. Maybe, these experiments explain us why community doesn't have any tendency to using it.

hasansalimkanmaz commented 3 years ago

Feel free to close the issue @dsindex Thanks for your efforts.

dsindex commented 3 years ago

@hasansalimkanmaz very appreciate :)

dsindex commented 3 years ago

good reference
- FLERT: Document-Level Features for Named Entity Recognition

Exploring Cross-sentence Contexts for Named Entity Recognition with BERT
- BERT-Large, Cased (Whole Word Masking) for English
- lr : 2e-5, 3e-5, 5e-5
- batch size : 2, 4, 8, 16
- epoch : 1, 2, 3, 4

dsindex commented 3 years ago

i try to change —bert_use_crf_slice option to —bert_use_subword_pooling. so, release backup code(https://github.com/dsindex/ntagger/releases/tag/v1.0) before modification.

dsindex / ntagger

(Maybe) wrong implementation of crf layer #1

Let me explain my situation;

Possible problem with your implementation:

What should be implemented?

train

evaluate