Pre-trained BERT model - Githubissues

geo47 commented 3 years ago

Hi @dsindex

The new BERT-Finetune model produces better results than the Luke benchmark. F1 score : 94.60 on CoNLL-2003 dataset

dsindex commented 3 years ago

i have done an experiment for the BERT-Finetuned(downloaded). but i got just F1 score 90.35. how to reproduce your result?

$ python preprocess.py --config=configs/config-bert.json --data_dir=data/conll2003 --bert_model_name_or_path=./embeddings/BERT-Finetuned

$ python train.py --config=configs/config-bert.json --data_dir=data/conll2003 --save_path=pytorch-model-bert.pt --bert_model_name_or_path=./embeddings/BERT-Finetuned --bert_output_dir=bert-checkpoint --batch_size=32 --lr=1e-5 --epoch=30 --bert_disable_lstm

$ python evaluate.py --config=configs/config-bert.json --data_dir=data/conll2003 --model_path=pytorch-model-bert.pt --bert_output_dir=bert-checkpoint --bert_disable_lstm
INFO:__main__:[token classification F1] : 0.9034525169111833, 3684
INFO:__main__:[Elapsed Time] : 3684 examples, 73171.78511619568ms, 19.837638783513157ms on average

$ cd data/conll2003; perl ../../etc/conlleval.pl < test.txt.pred ; cd ../..
accuracy:  98.14%; precision:  89.66%; recall:  91.04%; FB1:  90.35

geo47 commented 3 years ago

The result I shared was train set score on Bi-LSTM-MHA-CRF network. The evaluation score was F1: 0.8977

$ python preprocess.py --data_dir=data/conll2003 --bert_model_name_or_path model/BERT-Finetunedcased_L-12_H-768_A-12 --bert_use_sub_label

$ python train.py --batch_size 64 --eval_batch_size 128 --configs/config-bert.json --data_dir=data/conll2003 --bert_model_name_or_path model/BERT-Finetuned/cased_L-12_H-768_A-12 --save_path=pytorch-model-bert-en.pt --bert_output_dir=bert-checkpoint-en --epoch=30 --bert_use_pos --use_char_cnn --use_mha --bert_use_feature_based --use_crf

$ python evaluate.py --config=configs/config-bert.json --data_dir=data/conll2003 --model_path=pytorch-model-bert-en.pt --bert_output_dir=bert-checkpoint-en --bert_use_pos --use_char_cnn --use_mha --bert_use_feature_based --use_crf

Also, based on your results what would be the best parameters to optimize the model.

(I used BERT feature-based and apply mean pooling to the last four layers and not using DSA pooling)

last_four_hidden_states = all_hidden_states[-4:]

Thanks.

geo47 commented 3 years ago

Hi @dsindex,

I have a question:

What is the advantage of freezing BERT for few epochs during fine-tuning?

As far as I know, we freeze few layers of model while fine-tuning using transfer learning instead of freezing the whole model for few epochs. Could you please elaborate this?

Thanks :-)

dsindex commented 3 years ago

@geo47

I think it is a kind of heuristics.

$ python train.py ... --bert_freezing_epoch=4 --bert_lr_during_freezing=1e-3

https://github.com/dsindex/ntagger/blob/master/train.py#L56

   if args.bert_freezing_epoch > 0:
        # apply second optimizer/scheduler during freezing epochs
        if epoch_i < args.bert_freezing_epoch and optimizer_2nd != None and scheduler_2nd != None:
            optimizer = optimizer_2nd
            scheduler = scheduler_2nd
            freeze_bert = True

You may catch the learning rate for GloVe+LSTM+CRF is generally larger than for BERT.

e.g., --lr=1e-4  vs --lr=1e-5

Then, how can we accelerate the learning curve for BERT+LSTM+CRF? I guess that

freeze BERT layers and update the remaining layers with larger learning rate will be more fast
update all layers with smaller learning rate after some epoch as normal

Base on my experiments, freezing BERT layers for some epoch but updating other layers(LSTM, CRF) yield much better results.

geo47 commented 3 years ago

But what about fine-tuning BERT model itself, without (LSTM, CRF), and freezing few layers layers while fine-tuning.

Something like that: link

Also, the results reported in BERT paper (92.4) for fine-tuning (BERT-base) on NER task on CoNLL-2003 dataset, using this code I can not reproduce the same results. Had you reproduced the same results.

Finally, My last query is regarding CharCNN. How we train CharCNN model to get the char embeeddings. Do you have any reference blog or documentation to understand this.

Thanks...

dsindex commented 3 years ago

@geo47

I'm not sure it'll be good or bad if we freeze some layers inside BERT model(e.g. embeddings, encoder.layer[:5]) instead of freezing all BERT layers while fine-tuning tasks.

You could try it by modifying:

# 1. Embedding        
if freeze_bert:
    with torch.no_grad():
        bert_embed_out, bert_outputs = self._compute_bert_embedding(x, head_mask=head_mask)
else:
    bert_embed_out, bert_outputs = self._compute_bert_embedding(x, head_mask=head_mask)
    # bert_embed_out : [batch_size, seq_size, *]

=>

if freeze_bert:
    modules = [self.bert_model.embeddings, self.bert_model.encoder.layer[:5]] #Replace 5 by what you want
    for module in mdoules:
        for param in module.parameters():
            param.requires_grad = False
    ...
else:
    # reset requires_grad here
    ...
    bert_embed_out, bert_outputs = self._compute_bert_embedding(x, head_mask=head_mask)
    # bert_embed_out : [batch_size, seq_size, *]

Without document-context, we are not able to reproduce the results of original BERT paper. https://github.com/dsindex/ntagger/issues/4#issuecomment-810304253 (impossible i think)

3.

I'm sorry for your question about CharCNN. could you explain more detail? CharCNN layer is implemented by TextCNN via reshaping [batch_size, seq_length, char_length, char_dim] to [batch_size * seq_length, char_length, char_dim](sort of trick).

geo47 commented 3 years ago

Hi @dsindex

I got the first point.

About second point, what could be the possible reason we can't reproduce the same results. Is the implementation different and effects the original BERT fine-tuning?

For the CharCNN implemented by TextCNN layer, my question was: how we get the character embedding? Does it train character Convolution during preprocess.py to get character embedding or we don't fine-tune it at all, this is confusing for me. As for POS embedding, I could see it gets the embedding for each unique tag from the dictionary and make embedding in the form of look-up table (one-hot).

Thanks..

dsindex commented 3 years ago

@geo47

as the second point, at the first version of paper, the author didn’t mention about the usage of document context to get the results. so it makes many people confusing. but the latest version points out that they used full document context. therefore, we should use it for reproduction. as far as i know, there is no one who achieves the f1 score without the full document context. even LUKE paper said they also use the full document context. you can check it through https://github.com/studio-ousia/luke

and for the last point, you can figure out that character ids are generated using elmo library.

https://github.com/dsindex/ntagger/blob/master/util/util_bert.py#L390

# char extension       
if args.bert_use_subword_pooling:  
    c_ids = batch_to_ids([word])[0].detach().cpu().numpy().tolist()        
    char_ids.extend(c_ids)       
else:          
    c_ids = batch_to_ids([word_tokens])[0].detach().cpu().numpy().tolist()            
    char_ids.extend(c_ids)

since vocab size inside elmo lib is 256(ascii codes), we can convert every words to character ids sequence without concerning about language. [batch_size, sequence_length, character_length] https://github.com/dsindex/ntagger/blob/master/dataset/dataset.py#L79

and then, we use a random variable for lookup embedding just like pos embeddings.

[character_vocab_size, character_dim] https://github.com/dsindex/ntagger/blob/master/model/model.py#L30 create_embedding_layer() without weights_matrix

the character embedding will be trained via back-propagation during fine-tuing task.

geo47 commented 3 years ago

Oh thanks,

I got my answer that we fine-tune the character-embedding via TextCNN. "the character embedding will be trained via back-propagation during fine-tuing task."

dsindex / ntagger

Pre-trained BERT model #7