Closed hasansalimkanmaz closed 3 years ago
@hasansalimkanmaz
i agree with you. inconsistent results with bert+crf may support your thought.
because of that point, i had implemented a tensorflow code for bert subtoken alignment.
https://github.com/dsindex/etagger/blob/master/feed.py#L79 """Align bert_embeddings via bert_wordidx2tokenidx ex) word : 'johanson was a guy to' [0 ~ 4] token : 'johan ##son was a gu ##y t ##o' [0 ~ 7] wordidx2tokenidx : [1 3 4 5 7 9 0 0 ...] (bert embedding begins with [CLS] token) bert embedding : [em('CLS'), em('johan'), em('##son'), em('was'), em('a'), em('gu'), em('##y'), em('t'), em('##o'), 0, ...] """
however, that was a feature-based approach of using bert embeddings(no fine-tuning bert). and i have not yet try to implement in pytorch.
@hasansalimkanmaz
i had an experiment like below:
case 1) set subword label to pad_token_label_id=0
- ex) 'BR', '##US', '##SE', '##LS' -> 6/'B-LOC', 0/'<pad>', 0/'<pad>', 0/'<pad>'
case 2) set subword label to original label except the default label 'O'
# using sub token label
# preprocessing
$ python preprocess.py --config=configs/config-bert.json --data_dir=data/conll2003 --bert_model_name_or_path=./embeddings/bert-base-cased --bert_use_sub_label
$ python train.py --config=configs/config-bert.json --data_dir=data/conll2003 --save_path=pytorch-model-bert.pt --bert_model_name_or_path=./embeddings/bert-base-cased --bert_output_dir=bert-checkpoint --batch_size=32 --lr=1e-5 --epoch=10 --bert_freezing_epoch=3 --bert_lr_during_freezing=1e-3 --use_crf
$ python evaluate.py --config=configs/config-bert.json --data_dir=data/conll2003 --model_path=pytorch-model-bert.pt --bert_output_dir=bert-checkpoint --use_crf $ cd data/conll2003; perl ../../etc/conlleval.pl < test.txt.pred ; cd ../..
- util_bert.py
for word, pos, label in zip(example.words, example.poss, example.labels):
# word extension
word_tokens = tokenizer.tokenize(word)
tokens.extend(word_tokens)
# pos extension: set same pos_id
pos_id = pos_map[pos]
pos_ids.extend([pos_id] + [pos_id] * (len(word_tokens) - 1))
# label extension: set pad_token_label_id
label_id = label_map[label]
if opt.bert_use_sub_label:
if label == config['default_label']:
# ex) 'round', '##er' -> 1/'O', 1/'O'
sub_token_label = label
sub_token_label_id = label_map[sub_token_label]
label_ids.extend([label_id] + [sub_token_label_id] * (len(word_tokens) - 1))
else:
# ex) 'BR', '##US', '##SE', '##LS' -> 6/'B-LOC', 9/'I-LOC', 9/'I-LOC', 9/'I-LOC'
sub_token_label = label
prefix, suffix = label.split('-', maxsplit=1)
if prefix == 'B': sub_token_label = 'I-' + suffix
sub_token_label_id = label_map[sub_token_label]
label_ids.extend([label_id] + [sub_token_label_id] * (len(word_tokens) - 1))
else:
label_ids.extend([label_id] + [pad_token_label_id] * (len(word_tokens) - 1))
- since i change the subword label to 'I-' label sequences, the CRF layer on the top should be sound.
- `seqeval` unnecessarily tasks into consideration subword labels.
- therefore, its F1 is not the final score we want. we should print out the prediction results to a file and then use `conlleval.pl` script to evaluate.
- evaluate.py
# write prediction
try:
pred_path = opt.test_path + '.pred'
with open(pred_path, 'w', encoding='utf-8') as f:
for i, bucket in enumerate(data): # foreach sentence
if i >= ys.shape[0]:
logger.info("[Stop to write predictions] : %s" % (i))
break
use_subtoken = False
ys_idx = 0
if config['emb_class'] not in ['glove', 'elmo']:
use_subtoken = True
ys_idx = 1 # account '[CLS]'
for j, entry in enumerate(bucket): # foreach token
entry = bucket[j]
pred_label = default_label
if ys_idx < ys.shape[1]:
pred_label = labels[preds[i][ys_idx]]
entry.append(pred_label)
f.write(' '.join(entry) + '\n')
if use_subtoken:
word = entry[0]
word_tokens = model.bert_tokenizer.tokenize(word)
ys_idx += len(word_tokens)
else:
ys_idx += 1
f.write('\n')
as a result, i got a slightly better F1 score.
<img width="757" alt="스크린샷 2021-02-19 오후 9 45 15" src="https://user-images.githubusercontent.com/8259057/108506288-e2102a00-72fb-11eb-97df-d1b1e760f740.png">
Thanks for your fast response. I think your approach is not what current trend expects. we shouldn't give a label to subwords according to this. Anyway, it is interesting to see that it returns slightly better results.
Currently, I am busy with something else, I can't go on with my crf work. If I will, I will let you know via this thread.
@hasansalimkanmaz
i just done another experiment.
case 3) slicing the embeddings from BERT layer(i.e, logits) to remain only the first token's of the word's
as you pointed out, i tried to remove all subword embeddings except the first one from the word. doing so, we could accomplish a sound usage to the crf layer and also eliminate needless computation cost.
https://github.com/dsindex/ntagger/commit/b3c7a0b1a55e349b826911c30f8d6d7adbba3b44
first, building word2token_idx in util_bert.py
word : the dog is hairy
word_idx : 0 1 2 3
------------------------------------------------------------------
tokens: [CLS] the dog is ha ##iry . [SEP] <pad> <pad> <pad> ...
token_idx: 0 1 2 3 4 5 6 7 8 9 10 ...
input_ids: x x x x x x x x 0 0 0 ...
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 ...
input_mask: 1 1 1 1 1 1 1 1 0 0 0 ...
label_ids: 0 1 1 1 1 0 1 0 0 0 0 ...
------------------------------------------------------------------
idx 0 1 2 3
word2token_idx: 1 2 3 4 0 0 0 ...
word2token_idx[idx] = token_idx
second, slicing logits before applying crf in model.py
if not self.use_crf: return logits
if self.use_crf and self.use_crf_slice:
word2token_idx = x[4]
mask_word2token_idx = torch.sign(torch.abs(word2token_idx)).to(torch.uint8).unsqueeze(-1).to(self.device)
# slice logits to remain first token's of word's before applying crf.
# solution from https://stackoverflow.com/questions/55628014/indexing-a-3d-tensor-using-a-2d-tensor
offset = torch.arange(0, logits.size(0) * logits.size(1), logits.size(1)).to(self.device)
index = word2token_idx + offset.unsqueeze(1)
logits = logits.reshape(-1, logits.shape[-1])[index]
logits *= mask_word2token_idx
prediction = self.crf.decode(logits)
wasi-ahmad
(author of the above stackoverflow article)third, slicing y(gold label
) before computing cross-entropy loss in train.py, evaluate.py
x = to_device(x, opt.device)
y = to_device(y, opt.device)
if opt.use_crf:
with autocast(enabled=opt.use_amp):
mask = x[1].to(torch.uint8)
if opt.bert_use_crf_slice:
# slice y to remain first token's of word's.
word2token_idx = x[4]
mask = torch.sign(torch.abs(word2token_idx)).to(torch.uint8).to(opt.device)
y = y.gather(1, word2token_idx)
y *= mask
# slicing logits to remain first token's of word's before applying crf, --bert_use_crf_slice
# preprocessing
$ python preprocess.py --config=configs/config-bert.json --data_dir=data/conll2003 --bert_model_name_or_path=./embeddings/bert-base-cased
# train
$ python train.py --config=configs/config-bert.json --data_dir=data/conll2003 --save_path=pytorch-model-bert.pt --bert_model_name_or_path=./embeddings/bert-base-cased --bert_output_dir=bert-checkpoint --batch_size=32 --lr=1e-5 --epoch=10 --bert_freezing_epoch=3 --bert_lr_during_freezing=1e-3 --use_crf --bert_use_crf_slice
# evaluate
$ python evaluate.py --config=configs/config-bert.json --data_dir=data/conll2003 --model_path=pytorch-model-bert.pt --bert_output_dir=bert-checkpoint --use_crf --bert_use_crf_slice
$ cd data/conll2003; perl ../../etc/conlleval.pl < test.txt.pred ; cd ../..
INFO:__main__:[F1] : 0.913277459197177, 3684
INFO:__main__:[Elapsed Time] : 3684 examples, 151587.14032173157ms, 41.12043155459907ms on average
accuracy: 98.26%; precision: 91.01%; recall: 91.64%; FB1: 91.33
however, despite i had expected a better result, the F1 score by this approach may not be statistically significant though.
thank you very much @dsindex for the info. I will let you know when I have done the similar experiment with my own setting.
A comparison --bert_use_sub_label
vs --bert_use_crf_slice
--bert_use_sub_label
: assign 'I-' labels to all subword tokens.--bert_use_crf_slice
: remove subword slice of logits before applying crf layer.it shows interesting results. generally there are more subword tokens in Korean dataset compared to CoNLL 2003. so, i guess slicing subword logits works better for it.
I have conducted my experiment by training LayoutLM model for scanned documents. Unfortunately, I can't say that results are better. They are very close to each other. Maybe, these experiments explain us why community doesn't have any tendency to using it.
Feel free to close the issue @dsindex Thanks for your efforts.
@hasansalimkanmaz very appreciate :)
good reference
i try to change —bert_use_crf_slice
option to —bert_use_subword_pooling
.
so, release backup code(https://github.com/dsindex/ntagger/releases/tag/v1.0) before modification.
I am working on adding a crf layer on top of bert-like model. I am stuck with subtokens for now.
Let me explain my situation;
I am using
pad_token_label_id=-100
by default and this leads to ignoring subtokens while calculating the loss as expected. However when I try to add crf layer on top of bert, thispad_token_label_id
results in IndexError in crf layer. Because crf layer tries to find the label with index -100.Possible problem with your implementation:
pad_token_label_id=0
which is weird because in this case subtokens are also calculated in the loss functions which is not what we expected for a tagging task.What should be implemented?
In crf layer, we shouldn't take `pad_token_label_id's into account. Because sub_tokens don't have any label for tagging task. we need to eliminate these subtokens before crf layer like we do with attention_mask.
If you need me to elaborate on this issue, let me know what is missing above.
Thanks in advance.