Closed geo47 closed 2 years ago
https://github.com/dsindex/ntagger/blob/master/train.py#L148
plz do not save bert model and tokenizer for KoBERT. you can modify the code for it.
and since KMOU dataset has no document seperator ‘—DOCSTART—‘, you can’t use document context option.
when you add '--bert_use_word_embedding' option, you need to have trained GloVe/Word2Vec embeddings on Korean corpus. its tokens should be morphological as KMOU dataset's tokens. e.g. '나는 학교에 간다' -> '나 는 학교 에 가 ㄴ다'
Hi @dsindex
The first problem resolved when I disable saving the BERT model. However, When I use --bert_use_subword_pooling
it gives the following error:
INFO:__main__:{'emb_class': 'bert', 'enc_class': 'bilstm', 'n_ctx': 180, 'lowercase': True, 'token_emb_dim': 300, 'pad_token': '<pad>', 'pad_token_id': 0, 'unk_token': '<unk>', 'unk_token_id': 1, 'dsa_num_attentions': 4, 'dsa_dim': 300, 'dsa_r': 2, 'pos_emb_dim': 100, 'pad_pos': '<pad>', 'pad_pos_id': 0, 'char_n_ctx': 50, 'char_vocab_size': 262, 'char_padding_idx': 261, 'char_emb_dim': 25, 'char_num_filters': 30, 'char_kernel_sizes': [3, 9], 'dropout': 0.1, 'lstm_hidden_dim': 200, 'lstm_num_layers': 2, 'lstm_dropout': 0.0, 'mha_num_attentions': 8, 'pad_label': '<pad>', 'pad_label_id': 0, 'default_label': 'O', 'prev_context_size': 64, 'args': Namespace(adam_epsilon=1e-08, batch_size=16, bert_disable_lstm=False, bert_freezing_epoch=3, bert_lr_during_freezing=0.001, bert_model_name_or_path='monologg/kobert', bert_output_dir='bert-checkpoint', bert_remove_layers='', bert_use_doc_context=False, bert_use_feature_based=False, bert_use_mtl=False, bert_use_pos=False, bert_use_subword_pooling=True, bert_use_word_embedding=False, config='configs/config-bert.json', criterion='CrossEntropyLoss', data_dir='data/kmounlp', device='cuda', elmo_options_file='embeddings/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json', elmo_weights_file='embeddings/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5', embedding_filename='embedding.npy', embedding_trainable=False, epoch=30, eval_and_save_steps=500, eval_batch_size=64, glabel_filename='glabel.txt', gradient_accumulation_steps=1, hp_search_optuna=False, hp_trials=24, label_filename='label.txt', log_dir='runs', lr=1e-05, max_grad_norm=1.0, max_train_steps=None, num_warmup_steps=None, patience=7, pos_filename='pos.txt', restore_path='', save_path='pytorch-model-bert.pt', seed=42, use_char_cnn=False, use_crf=True, use_mha=False, use_ncrf=False, warmup_epoch=0, warmup_ratio=0.0, weight_decay=0.01)}
Traceback (most recent call last):
File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/train.py", line 703, in <module>
main()
File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/train.py", line 700, in main
train(args)
File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/train.py", line 506, in train
train_loader, valid_loader = prepare_datasets(config)
File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/train.py", line 352, in prepare_datasets
train_loader = prepare_dataset(config,
File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/dataset/dataset.py", line 15, in prepare_dataset
dataset = DatasetClass(config, filepath)
File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/dataset/dataset.py", line 83, in __init__
all_word2token_idx = torch.tensor([f.word2token_idx for f in features], dtype=torch.long)
File "/media/erclab/Data021/Muzamil/projects/ntagger/ntagger-v3/dataset/dataset.py", line 83, in <listcomp>
all_word2token_idx = torch.tensor([f.word2token_idx for f in features], dtype=torch.long)
AttributeError: 'InputFeature' object has no attribute 'word2token_idx'
About --bert_use_word_embedding
option, How can I get GloVe/Word2Vec Korean versions?
Thanks..
@geo47
for word2vec or fasttext, you can download from https://github.com/Kyubyong/wordvectors
Hi @dsindex
I am using KoBERT model as mentioned here and data from this repo. Here is my data. The model gives the following error while saving the pre-trained model.
However, with the following settings it doesn't work at all.
Thanks.