subword_pooling option deosn't work for KoBERT model

geo47 commented 3 years ago

Hi @dsindex

I am using KoBERT model as mentioned here and data from this repo. Here is my data. The model gives the following error while saving the pre-trained model.

              precision    recall  f1-score   support

          DT     0.6934    0.6419    0.6667       148
          LC     0.7104    0.7065    0.7084       184
          OG     0.5283    0.4211    0.4686       133
          PS     0.8152    0.8869    0.8496       398
          TI     0.0000    0.0000    0.0000        29
           _     0.5823    0.4577    0.5125       201
        pad>     0.0000    0.0000    0.0000         0

   micro avg     0.7035    0.6642    0.6833      1093
   macro avg     0.4757    0.4449    0.4580      1093
weighted avg     0.6817    0.6642    0.6702      1093

INFO:__main__:[Best model saved] : 4.553981304168701, 0.6832941176470588
Traceback (most recent call last):
  File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/train.py", line 703, in <module>
    main()
  File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/train.py", line 700, in main
    train(args)
  File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/train.py", line 540, in train
    eval_loss, eval_f1, best_eval_f1 = train_epoch(model, config, train_loader, valid_loader, epoch_i, best_eval_f1)
  File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/train.py", line 182, in train_epoch
    unwrapped_model.bert_tokenizer.save_pretrained(args.bert_output_dir)
  File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1934, in save_pretrained
    save_files = self._save_pretrained(
  File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1979, in _save_pretrained
    vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
TypeError: save_vocabulary() got an unexpected keyword argument 'filename_prefix'

However, with the following settings it doesn't work at all.

--bert_use_doc_context, 
--bert_use_subword_pooling 
--bert_use_word_embedding

Thanks.

dsindex commented 3 years ago

https://github.com/dsindex/ntagger/blob/master/train.py#L148

plz do not save bert model and tokenizer for KoBERT. you can modify the code for it.

and since KMOU dataset has no document seperator ‘—DOCSTART—‘, you can’t use document context option.

when you add '--bert_use_word_embedding' option, you need to have trained GloVe/Word2Vec embeddings on Korean corpus. its tokens should be morphological as KMOU dataset's tokens. e.g. '나는 학교에 간다' -> '나 는 학교 에 가 ㄴ다'

geo47 commented 3 years ago

Hi @dsindex

The first problem resolved when I disable saving the BERT model. However, When I use --bert_use_subword_pooling it gives the following error:

INFO:__main__:{'emb_class': 'bert', 'enc_class': 'bilstm', 'n_ctx': 180, 'lowercase': True, 'token_emb_dim': 300, 'pad_token': '<pad>', 'pad_token_id': 0, 'unk_token': '<unk>', 'unk_token_id': 1, 'dsa_num_attentions': 4, 'dsa_dim': 300, 'dsa_r': 2, 'pos_emb_dim': 100, 'pad_pos': '<pad>', 'pad_pos_id': 0, 'char_n_ctx': 50, 'char_vocab_size': 262, 'char_padding_idx': 261, 'char_emb_dim': 25, 'char_num_filters': 30, 'char_kernel_sizes': [3, 9], 'dropout': 0.1, 'lstm_hidden_dim': 200, 'lstm_num_layers': 2, 'lstm_dropout': 0.0, 'mha_num_attentions': 8, 'pad_label': '<pad>', 'pad_label_id': 0, 'default_label': 'O', 'prev_context_size': 64, 'args': Namespace(adam_epsilon=1e-08, batch_size=16, bert_disable_lstm=False, bert_freezing_epoch=3, bert_lr_during_freezing=0.001, bert_model_name_or_path='monologg/kobert', bert_output_dir='bert-checkpoint', bert_remove_layers='', bert_use_doc_context=False, bert_use_feature_based=False, bert_use_mtl=False, bert_use_pos=False, bert_use_subword_pooling=True, bert_use_word_embedding=False, config='configs/config-bert.json', criterion='CrossEntropyLoss', data_dir='data/kmounlp', device='cuda', elmo_options_file='embeddings/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json', elmo_weights_file='embeddings/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5', embedding_filename='embedding.npy', embedding_trainable=False, epoch=30, eval_and_save_steps=500, eval_batch_size=64, glabel_filename='glabel.txt', gradient_accumulation_steps=1, hp_search_optuna=False, hp_trials=24, label_filename='label.txt', log_dir='runs', lr=1e-05, max_grad_norm=1.0, max_train_steps=None, num_warmup_steps=None, patience=7, pos_filename='pos.txt', restore_path='', save_path='pytorch-model-bert.pt', seed=42, use_char_cnn=False, use_crf=True, use_mha=False, use_ncrf=False, warmup_epoch=0, warmup_ratio=0.0, weight_decay=0.01)}
Traceback (most recent call last):
  File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/train.py", line 703, in <module>
    main()
  File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/train.py", line 700, in main
    train(args)
  File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/train.py", line 506, in train
    train_loader, valid_loader = prepare_datasets(config)
  File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/train.py", line 352, in prepare_datasets
    train_loader = prepare_dataset(config,
  File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/dataset/dataset.py", line 15, in prepare_dataset
    dataset = DatasetClass(config, filepath)
  File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/dataset/dataset.py", line 83, in __init__
    all_word2token_idx = torch.tensor([f.word2token_idx for f in features], dtype=torch.long)
  File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/dataset/dataset.py", line 83, in <listcomp>
    all_word2token_idx = torch.tensor([f.word2token_idx for f in features], dtype=torch.long)
AttributeError: 'InputFeature' object has no attribute 'word2token_idx'

About --bert_use_word_embedding option, How can I get GloVe/Word2Vec Korean versions?

Thanks..

dsindex commented 3 years ago

@geo47

for word2vec or fasttext, you can download from https://github.com/Kyubyong/wordvectors

dsindex / ntagger

subword_pooling option deosn't work for KoBERT model #10