Closed geo47 closed 3 years ago
i have done an experiment for the BERT-Finetuned(downloaded). but i got just F1 score 90.35. how to reproduce your result?
$ python preprocess.py --config=configs/config-bert.json --data_dir=data/conll2003 --bert_model_name_or_path=./embeddings/BERT-Finetuned
$ python train.py --config=configs/config-bert.json --data_dir=data/conll2003 --save_path=pytorch-model-bert.pt --bert_model_name_or_path=./embeddings/BERT-Finetuned --bert_output_dir=bert-checkpoint --batch_size=32 --lr=1e-5 --epoch=30 --bert_disable_lstm
$ python evaluate.py --config=configs/config-bert.json --data_dir=data/conll2003 --model_path=pytorch-model-bert.pt --bert_output_dir=bert-checkpoint --bert_disable_lstm
INFO:__main__:[token classification F1] : 0.9034525169111833, 3684
INFO:__main__:[Elapsed Time] : 3684 examples, 73171.78511619568ms, 19.837638783513157ms on average
$ cd data/conll2003; perl ../../etc/conlleval.pl < test.txt.pred ; cd ../..
accuracy: 98.14%; precision: 89.66%; recall: 91.04%; FB1: 90.35
The result I shared was train set score on Bi-LSTM-MHA-CRF network. The evaluation score was F1: 0.8977
$ python preprocess.py --data_dir=data/conll2003 --bert_model_name_or_path model/BERT-Finetunedcased_L-12_H-768_A-12 --bert_use_sub_label
$ python train.py --batch_size 64 --eval_batch_size 128 --configs/config-bert.json --data_dir=data/conll2003 --bert_model_name_or_path model/BERT-Finetuned/cased_L-12_H-768_A-12 --save_path=pytorch-model-bert-en.pt --bert_output_dir=bert-checkpoint-en --epoch=30 --bert_use_pos --use_char_cnn --use_mha --bert_use_feature_based --use_crf
$ python evaluate.py --config=configs/config-bert.json --data_dir=data/conll2003 --model_path=pytorch-model-bert-en.pt --bert_output_dir=bert-checkpoint-en --bert_use_pos --use_char_cnn --use_mha --bert_use_feature_based --use_crf
Also, based on your results what would be the best parameters to optimize the model.
last_four_hidden_states = all_hidden_states[-4:]
Thanks.
Hi @dsindex,
I have a question:
What is the advantage of freezing BERT for few epochs during fine-tuning?
As far as I know, we freeze few layers of model while fine-tuning using transfer learning instead of freezing the whole model for few epochs. Could you please elaborate this?
Thanks :-)
@geo47
I think it is a kind of heuristics.
$ python train.py ... --bert_freezing_epoch=4 --bert_lr_during_freezing=1e-3
https://github.com/dsindex/ntagger/blob/master/train.py#L56
if args.bert_freezing_epoch > 0:
# apply second optimizer/scheduler during freezing epochs
if epoch_i < args.bert_freezing_epoch and optimizer_2nd != None and scheduler_2nd != None:
optimizer = optimizer_2nd
scheduler = scheduler_2nd
freeze_bert = True
You may catch the learning rate for GloVe+LSTM+CRF is generally larger than for BERT.
e.g., --lr=1e-4 vs --lr=1e-5
Then, how can we accelerate the learning curve for BERT+LSTM+CRF? I guess that
Base on my experiments, freezing BERT layers for some epoch but updating other layers(LSTM, CRF) yield much better results.
But what about fine-tuning BERT model itself, without (LSTM, CRF), and freezing few layers layers while fine-tuning.
Something like that: link
Also, the results reported in BERT paper (92.4) for fine-tuning (BERT-base) on NER task on CoNLL-2003 dataset, using this code I can not reproduce the same results. Had you reproduced the same results.
Finally, My last query is regarding CharCNN. How we train CharCNN model to get the char embeeddings. Do you have any reference blog or documentation to understand this.
Thanks...
@geo47
I'm not sure it'll be good or bad if we freeze some layers inside BERT model(e.g. embeddings, encoder.layer[:5]) instead of freezing all BERT layers while fine-tuning tasks.
You could try it by modifying:
# 1. Embedding
if freeze_bert:
with torch.no_grad():
bert_embed_out, bert_outputs = self._compute_bert_embedding(x, head_mask=head_mask)
else:
bert_embed_out, bert_outputs = self._compute_bert_embedding(x, head_mask=head_mask)
# bert_embed_out : [batch_size, seq_size, *]
=>
if freeze_bert:
modules = [self.bert_model.embeddings, self.bert_model.encoder.layer[:5]] #Replace 5 by what you want
for module in mdoules:
for param in module.parameters():
param.requires_grad = False
...
else:
# reset requires_grad here
...
bert_embed_out, bert_outputs = self._compute_bert_embedding(x, head_mask=head_mask)
# bert_embed_out : [batch_size, seq_size, *]
Without document-context, we are not able to reproduce the results of original BERT paper. https://github.com/dsindex/ntagger/issues/4#issuecomment-810304253 (impossible i think)
3.
I'm sorry for your question about CharCNN. could you explain more detail?
CharCNN layer is implemented by TextCNN via reshaping [batch_size, seq_length, char_length, char_dim]
to [batch_size * seq_length, char_length, char_dim](sort of trick).
Hi @dsindex
I got the first point.
About second point, what could be the possible reason we can't reproduce the same results. Is the implementation different and effects the original BERT fine-tuning?
For the CharCNN implemented by TextCNN layer, my question was: how we get the character embedding? Does it train character Convolution during preprocess.py
to get character embedding or we don't fine-tune it at all, this is confusing for me. As for POS embedding, I could see it gets the embedding for each unique tag from the dictionary and make embedding in the form of look-up table (one-hot).
Thanks..
@geo47
as the second point, at the first version of paper, the author didn’t mention about the usage of document context to get the results. so it makes many people confusing. but the latest version points out that they used full document context. therefore, we should use it for reproduction. as far as i know, there is no one who achieves the f1 score without the full document context. even LUKE paper said they also use the full document context. you can check it through https://github.com/studio-ousia/luke
and for the last point, you can figure out that character ids are generated using elmo library.
https://github.com/dsindex/ntagger/blob/master/util/util_bert.py#L390
# char extension
if args.bert_use_subword_pooling:
c_ids = batch_to_ids([word])[0].detach().cpu().numpy().tolist()
char_ids.extend(c_ids)
else:
c_ids = batch_to_ids([word_tokens])[0].detach().cpu().numpy().tolist()
char_ids.extend(c_ids)
since vocab size inside elmo lib is 256(ascii codes), we can convert every words to character ids sequence without concerning about language. [batch_size, sequence_length, character_length] https://github.com/dsindex/ntagger/blob/master/dataset/dataset.py#L79
and then, we use a random variable for lookup embedding just like pos embeddings.
[character_vocab_size, character_dim] https://github.com/dsindex/ntagger/blob/master/model/model.py#L30 create_embedding_layer() without weights_matrix
the character embedding will be trained via back-propagation during fine-tuing task.
Oh thanks,
I got my answer that we fine-tune the character-embedding via TextCNN. "the character embedding will be trained via back-propagation during fine-tuing task."
Hi @dsindex
The new BERT-Finetune model produces better results than the Luke benchmark. F1 score : 94.60 on CoNLL-2003 dataset