Closed xuuuluuu closed 3 years ago
i have a plan to implement —bert_use_subword_pooling —bert_use_word_embedding —bert_use_doc_context
options.
reference : https://github.com/dsindex/ntagger/issues/1#issuecomment-806583874
@xuuuluuu
as i mentioned, i add --bert_use_doc_context --bert_use_subword_pooling
option.
https://github.com/dsindex/ntagger/blob/master/util_bert.py
---------------------------------------------------------------------------
with --bert_use_doc_context:
with --bert_doc_context_option=1:
prev example example next examples
tokens: [CLS] p1 p2 p3 p4 p5 x1 x2 x3 x4 n1 n2 n3 n4 m1 m2 m3 ...
token_idx: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
input_ids: x x x x x x x x x x x x x x x x x ...
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
doc2sent_idx: 0 6 7 8 9 0 0 0 0 0 0 0 0 0 0 0 0 ...
doc2sent_mask: 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 ...
with --bert_doc_context_option=2:
prev examples example next examples
input_ids, segment_ids, input_maks are replaced to document-level.
and doc2sent_idx will be used to slice input_ids, segment_ids, input_mask.
---------------------------------------------------------------------------
# modify maximum sequence length for document context
$ vi configs/config-bert.json
"n_ctx": 512
# for Linear
## preprocessing
$ python preprocess.py --config=configs/config-bert.json --data_dir=data/conll2003 --bert_model_name_or_path=./embeddings/bert-base-cased --bert_use_doc_context --bert_use_subword_pooling --bert_doc_context_option=1
## train
$ python train.py --config=configs/config-bert.json --data_dir=data/conll2003 --save_path=pytorch-model-bert.pt --bert_model_name_or_path=./embeddings/bert-base-cased --bert_output_dir=bert-checkpoint --batch_size=16 --lr=2e-5 --epoch=30 --bert_use_doc_context --bert_use_subword_pooling --bert_disable_lstm
## evaluate
$ python evaluate.py --config=configs/config-bert.json --data_dir=data/conll2003 --model_path=pytorch-model-bert.pt --bert_output_dir=bert-checkpoint --bert_use_doc_context --bert_use_subword_pooling --bert_disable_lstm
$ cd data/conll2003; perl ../../etc/conlleval.pl < test.txt.pred ; cd ../..
# for BiLSTM-CRF + word embedding
## preprocessing
$ python preprocess.py --config=configs/config-bert.json --data_dir=data/conll2003 --bert_model_name_or_path=./embeddings/bert-base-cased --bert_use_doc_context --bert_use_subword_pooling --bert_use_word_embedding --bert_doc_context_option=1
## train
$ python train.py --config=configs/config-bert.json --data_dir=data/conll2003 --save_path=pytorch-model-bert.pt --bert_model_name_or_path=./embeddings/bert-base-cased --bert_output_dir=bert-checkpoint --batch_size=8 --lr=1e-5 --epoch=30 --bert_freezing_epoch=3 --bert_lr_during_freezing=1e-3 --use_crf --bert_use_doc_context --bert_use_subword_pooling --bert_use_word_embedding
## evaluate
$ python evaluate.py --config=configs/config-bert.json --data_dir=data/conll2003 --model_path=pytorch-model-bert.pt --bert_output_dir=bert-checkpoint --use_crf --bert_use_doc_context --bert_use_subword_pooling --bert_use_word_embedding
$ cd data/conll2003; perl ../../etc/conlleval.pl < test.txt.pred ; cd ../..
additionally, with --bert_use_word_embedding
option, you can add GloVe word embedding features to bert embeddings at word-level. it is likely to get a better result.
Hi, Thanks for the nice framework.
Do you plan to include the document context as the BERT paper did?