Open kongds opened 2 years ago
Thanks for your interest in our work! The results are correct. I suppose that this is because CAP yields more improvement in movement pruning than soft movement pruning in 90% sparsity.
Thanks for your answer. Another concern is the F1(87.7) seems not match accuracy(91.6) in CAP-m, which means the FN (false negative) and TN (true negative) is huge unbalance compared to the results in another settings.
Hello, I run the cap-m 0.10 on QQP based on run_glue_topk_kd.sh
, but get the following results (90.5/87.2).
07/14/2022 23:41:19 - INFO - __main__ - ***** Eval results *****
07/14/2022 23:41:19 - INFO - __main__ - acc = 0.904699480583725
07/14/2022 23:41:19 - INFO - __main__ - acc_and_f1 = 0.888130932998286
07/14/2022 23:41:19 - INFO - __main__ - eval_avg_entropy = 1.0659542
07/14/2022 23:41:19 - INFO - __main__ - f1 = 0.871562385412847
The command is:
OUTPUT=cap
TASK=qqp
DATA_DIR=../data/glue_data/QQP
MODEL=bert-base-uncased
BATCH=32
EPOCH=10
LR=3e-5
# pruning
METHOD=topK
MASK_LR=1e-2
WARMUP=11000
INITIAL_TH=1
FINAL_TH=0.10 # 50% -> 0.5 90% -> 0.1 97% -> 0.03
# contrastive
CONTRASTIVE_TEMPERATURE=0.1
EXTRA_EXAMPLES=4096
ALIGNREP=cls
CL_UNSUPERVISED_LOSS_WEIGHT=0.1
CL_SUPERVISED_LOSS_WEIGHT=10
# distill
TEACHER_TYPE=bert
TEACHER_PATH=../teacher/qqp
CE_LOSS_WEIGHT=0.1
DISTILL_LOSS_WEIGHT=0.9
CUDA_VISIBLE_DEVICES=${GPU} python masked_run_glue.py \
--output_dir ${OUTPUT}/${FINAL_TH}/${TASK} \
--data_dir ${DATA_DIR} \
--do_train --do_eval --do_lower_case \
--model_type masked_bert \
--model_name_or_path ${MODEL} \
--per_gpu_train_batch_size ${BATCH} \
--warmup_steps ${WARMUP} \
--num_train_epochs ${EPOCH} \
--learning_rate ${LR} --mask_scores_learning_rate ${MASK_LR} \
--initial_threshold ${INITIAL_TH} --final_threshold ${FINAL_TH} \
--initial_warmup 2 --final_warmup 3 \
--pruning_method ${METHOD} --mask_init constant --mask_scale 0.0 \
--task_name ${TASK} \
--save_steps 30000 \
--use_contrastive_loss \
--contrastive_temperature ${CONTRASTIVE_TEMPERATURE} \
--cl_unsupervised_loss_weight ${CL_UNSUPERVISED_LOSS_WEIGHT} \
--cl_supervised_loss_weight ${CL_SUPERVISED_LOSS_WEIGHT} \
--extra_examples ${EXTRA_EXAMPLES} \
--alignrep ${ALIGNREP} \
--use_distill \
--teacher_name_or_path ${TEACHER_PATH} \
--teacher_type ${TEACHER_TYPE} \
--ce_loss_weight ${CE_LOSS_WEIGHT} \
--distill_loss_weight ${DISTILL_LOSS_WEIGHT}
Hi,the performance can also be affected by the teacher model. How is the performance of your teacher model?
Hello, I find the results of CAP-m in 90% sparsity QQP is "91.6/87.7", while CAP-soft is "90.7/87.4"(bold). Is the result of CAP-m correct?