Code, models and Datasets for《Neural Semi-supervised Learning for Text Classification Under Large-Scale Pretraining》.
Datasets and Models are found in the follwing list.
[REVIEWS_PATH]
.
You can download the dataset HERE. [VANILLA_ROBERTA_LARGE_PATH]
.
You can download the model HERE. [PRETRAIN_MODELS]
. We provide three following models.
You can download HERE.
init-roberta-base
: RoBERTa-base model(U) trained over 3.4M movie reviews from scratch.semi-roberta-base
: RoBERTa-base model(Large U + U) trained over 3.4M movie reviews from the open-domain pretrained model RoBERTa-base model.semi-roberta-large
: RoBERTa-large model(Large U + U) trained over 3.4M movie reviews from the open-domain pretrained model RoBERTa-large model.[STUDENT_DATA_PATH]
.
You can download it HERE.
student_data_base
: student training data generated by roberta-base teacher model student_data_large
: student training data generated by roberta-large teacher model [IMDB_DATA_PATH]
. For IMDB,
The training data and test data are saved in two separate files, each line in the file corresponds to one IMDB sample.
You can download HERE.[SHANNON_PREPROCESS_WHL_PATH]
.
You can download HERE[CHECKPOINTS]
.
You can download HERE
roberta-base
: teacher and student model checkpoint for roberta-base roberta-large
: teacher and student model checkpoint for roberta-large pip install -r requirements.txt
pip install [SHANNON_PREPROCESS_WHL_PATH]
Use the roberta model we pretrained over 3.4M reviews data to train teacher model.
Our teacher model had an accuracy rate of 96.2% on the test set.
cd sstc/tasks/semi-roberta
python trainer.py \
--mode train_teacher \
roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--gpus=0,1,2,3 \
--save_path [ROOT_SAVE_PATH] \
--precision 16 \
--batch_size 10 \
--min_epochs 10 \
--patience 3 \
--lr 3e-5
Use the roberta model we pretrained over 3.4M reviews data to train student model.
Our student model had an accuracy rate of 96.8% on the test set.
cd sstc/tasks/semi-roberta
python trainer.py \
--mode train_student \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--student_data_path [STUDENT_DATA_PATH]/student_data_large/bin \
--save_path [ROOT_SAVE_PATH] \
--batch_size=10 \
--precision 16 \
--lr=2e-5 \
--warmup_steps 40000 \
--gpus=0,1,2,3,4,5,6,7 \
--accumulate_grad_batches=50
Load student model checkpoint to evaluate over test set to reproduce our result.
cd sstc/tasks/semi-roberta
python evaluate.py \
--checkpoint_path [CHECKPOINTS]/roberta-large/train_student_checkpoint/***.ckpt \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--batch_size=10 \
--gpus=0,
You should modify the shell according to your paths. The result binarize data will be saved in [REVIEWS_PATH]/bin
cd sstc/tasks/roberta_lm
bash binarize.sh
cd sstc/tasks/roberta_lm
python trainer.py \
--roberta_path [VANILLA_ROBERTA_LARGE_PATH] \
--data_dir [REVIEWS_PATH]/bin \
--gpus=0,1,2,3 \
--save_path [PRETRAIN_ROBERTA_CK_PATH] \
--val_check_interval 0.1 \
--precision 16 \
--batch_size 10 \
--distributed_backend=ddp \
--accumulate_grad_batches=50 \
--adam_epsilon 1e-6 \
--weight_decay 0.01 \
--warmup_steps 10000 \
--workers 8 \
--lr 2e-5
Training checkpoints will be saved in [PRETRAIN_ROBERTA_CK_PATH]
,
find the best checkpoint and convert it to HuggingFace bin format,
The relevant code can be found in sstc/tasks/roberta_lm/trainer.py
.
Save the pretrain bin model at [PRETRAIN_MODELS]\semi-roberta-large
,
or you can just download the model we trained.
cd sstc/tasks/semi_roberta/scripts
bash binarize_imdb.sh
You can run the above code to binarize IMDB data, or you can just use the file we binarized in [IMDB_DATA_PATH]\bin
cd sstc/tasks/semi_roberta
python trainer.py \
--mode train_teacher \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--gpus=0,1,2,3 \
--save_path [ROOT_SAVE_PATH] \
--precision 16 \
--batch_size 10 \
--min_epochs 10 \
--patience 3 \
--lr 3e-5
After training, teacher model checkpoint will be save in [ROOT_SAVE_PATH]/train_teacher_checkpoint
.
The teacher model we trained had an accuracy rate of 96.2% on the test set.
The download link of teacher model checkpoint can be found in quick tour part.
Use the teacher model that you trained in previous step to label 3.4M reviews data,
notice that [ROOT_SAVE_PATH]
should be the same as previous setting.
The labeled data will be save in [ROOT_SAVE_PATH]\predictions
.
cd sstc/tasks/roberta_lm
python trainer.py \
--mode train_teacher \
--roberta_path [PRETRAIN_ROBERTA_PATH] \
--reviews_data_path [REVIEWS_PATH]/bin \
--best_teacher_checkpoint_path [CHECKPOINTS]/roberta-large/train_teacher_checkpoint/***.ckpt \
--gpus=0,1,2,3 \
--save_path [ROOT_SAVE_PATH]
Firstly, we random sample 3M data from 3.4M reviews data as U',
then we select 1M data from U' with the highest score as D',
finally, we concat the IMDB train data(D) and D' as train data for student model.
The student train data will be saved in [ROOT_SAVE_PATH]\student_data\train.txt
,
or you can use the data we provide in [STUDENT_DATA_PATH]/student_data_large
cd sstc/tasks/roberta_lm
python data_selector.py \
--imdb_data_path [IMDB_DATA_PATH] \
--save_path [ROOT_SAVE_PATH]
You can use the same script in 3.1 to binarize student train data in [ROOT_SAVE_PATH]\student_data\train.txt
use can use the training data we provide in [STUDENT_DATA_PATH]/student_data_large/bin
or use your own training data in
[ROOT_SAVE_PATH]\student_data\bin
, make sure you set the right student_data_path
.
cd sstc/tasks/semi-roberta
python trainer.py \
--mode train_student \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--student_data_path [STUDENT_DATA_PATH]/student_data_large/bin \
--save_path [ROOT_SAVE_PATH] \
--batch_size=10 \
--precision 16 \
--lr=2e-5 \
--warmup_steps 40000 \
--gpus=0,1,2,3,4,5,6,7 \
--accumulate_grad_batches=50
After training, student model checkpoint will be save in [ROOT_SAVE_PATH]/train_student_checkpoint
.
The student model we trained had an accuracy rate of 96.6% on the test set.
The download link of student model checkpoint can be found in Quick tour part.