RunxinXu / ContrastivePruning

Source code for our AAAI'22 paper 《From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression》
23 stars 8 forks source link

90% sparsity QQP result #1

Open kongds opened 2 years ago

kongds commented 2 years ago

Hello, I find the results of CAP-m in 90% sparsity QQP is "91.6/87.7", while CAP-soft is "90.7/87.4"(bold). Is the result of CAP-m correct? CleanShot 2022-07-12 at 14 42 40@2x

RunxinXu commented 2 years ago

Thanks for your interest in our work! The results are correct. I suppose that this is because CAP yields more improvement in movement pruning than soft movement pruning in 90% sparsity.

kongds commented 2 years ago

Thanks for your answer. Another concern is the F1(87.7) seems not match accuracy(91.6) in CAP-m, which means the FN (false negative) and TN (true negative) is huge unbalance compared to the results in another settings.

kongds commented 2 years ago

Hello, I run the cap-m 0.10 on QQP based on run_glue_topk_kd.sh, but get the following results (90.5/87.2).

07/14/2022 23:41:19 - INFO - __main__ -   ***** Eval results  *****
07/14/2022 23:41:19 - INFO - __main__ -     acc = 0.904699480583725
07/14/2022 23:41:19 - INFO - __main__ -     acc_and_f1 = 0.888130932998286
07/14/2022 23:41:19 - INFO - __main__ -     eval_avg_entropy = 1.0659542
07/14/2022 23:41:19 - INFO - __main__ -     f1 = 0.871562385412847

The command is:


OUTPUT=cap
TASK=qqp
DATA_DIR=../data/glue_data/QQP
MODEL=bert-base-uncased
BATCH=32
EPOCH=10
LR=3e-5

# pruning
METHOD=topK
MASK_LR=1e-2
WARMUP=11000
INITIAL_TH=1
FINAL_TH=0.10 # 50% -> 0.5 90% -> 0.1 97% -> 0.03

# contrastive
CONTRASTIVE_TEMPERATURE=0.1
EXTRA_EXAMPLES=4096
ALIGNREP=cls
CL_UNSUPERVISED_LOSS_WEIGHT=0.1
CL_SUPERVISED_LOSS_WEIGHT=10

# distill
TEACHER_TYPE=bert
TEACHER_PATH=../teacher/qqp
CE_LOSS_WEIGHT=0.1
DISTILL_LOSS_WEIGHT=0.9

CUDA_VISIBLE_DEVICES=${GPU} python masked_run_glue.py \
    --output_dir ${OUTPUT}/${FINAL_TH}/${TASK} \
    --data_dir ${DATA_DIR} \
    --do_train --do_eval --do_lower_case \
    --model_type masked_bert \
    --model_name_or_path ${MODEL} \
    --per_gpu_train_batch_size ${BATCH} \
    --warmup_steps ${WARMUP} \
    --num_train_epochs ${EPOCH} \
    --learning_rate ${LR} --mask_scores_learning_rate ${MASK_LR} \
    --initial_threshold ${INITIAL_TH} --final_threshold ${FINAL_TH} \
    --initial_warmup 2 --final_warmup 3 \
    --pruning_method ${METHOD} --mask_init constant --mask_scale 0.0 \
    --task_name ${TASK} \
    --save_steps 30000 \
    --use_contrastive_loss \
    --contrastive_temperature ${CONTRASTIVE_TEMPERATURE} \
    --cl_unsupervised_loss_weight ${CL_UNSUPERVISED_LOSS_WEIGHT} \
    --cl_supervised_loss_weight ${CL_SUPERVISED_LOSS_WEIGHT} \
    --extra_examples ${EXTRA_EXAMPLES} \
    --alignrep ${ALIGNREP} \
    --use_distill \
    --teacher_name_or_path ${TEACHER_PATH} \
    --teacher_type ${TEACHER_TYPE} \
    --ce_loss_weight ${CE_LOSS_WEIGHT} \
    --distill_loss_weight ${DISTILL_LOSS_WEIGHT}
RunxinXu commented 2 years ago

Hi,the performance can also be affected by the teacher model. How is the performance of your teacher model?

kongds commented 2 years ago

I use the checkpoint provided by dyanbert. The performance is 90.9 acc on QQP.