For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.
Some codes are borrowed from ONMT(https://github.com/OpenNMT/OpenNMT-py)
Put all files into raw_data
directory
We will need Stanford CoreNLP to tokenize the data. Download it here and unzip it. Then add the following command to your bash_profile:
export CLASSPATH=/path/to/stanford-corenlp-4.2.2/stanford-corenlp-4.2.2.jar
replacing /path/to/
with the path to where you saved the stanford-corenlp-4.2.2
directory.
python preprocess.py -mode extract_pdf_sections -log_file ../logs/extract_section.log
python preprocess.py -mode get_text_clean_tika -log_file ../logs/extract_tika_text.log
python preprocess.py -mode tokenize -save_path ../temp -log_file ../logs/tokenize_by_corenlp.log
python preprocess.py -mode clean_paper_jsons -save_path ../json_data/ -n_cpus 10 -log_file ../logs/build_json.log
.pt
files from source, sections and targetspython preprocess.py -mode format_to_bert -raw_path ../json_data/ -save_path ../bert_data -lower -n_cpus 40 -log_file ../logs/build_bert_files.log
First run: For the first time, you should use single-GPU, so the code can download the BERT model. Use -visible_gpus -1
, after downloading, you could kill the process and rerun the code with multi-GPUs.
python train.py -ext_dropout 0.1 -lr 2e-3 -visible_gpus 1,2,3 -report_every 200 -save_checkpoint_steps 1000 -batch_size 1 -train_steps 100000 -accum_count 2 -log_file ../logs/ext_bert -use_interval true -warmup_steps 10000
To continue training from a checkpoint
python train.py -ext_dropout 0.1 -lr 2e-3 -train_from ../models/model_step_99000.pt -visible_gpus 1,2,3 -report_every 200 -save_checkpoint_steps 1000 -batch_size 1 -train_steps 100000 -accum_count 2 -log_file ../logs/ext_bert -use_interval true -warmup_steps 10000
python train.py -mode test -test_batch_size 1 -bert_data_path ../bert_data -log_file ../logs/ext_bert_test -test_from ../models/model_step_99000.pt -model_path ../models -sep_optim true -use_interval true -visible_gpus 1,2,3 -alpha 0.95 -result_path ../results/ext