Open sergiogcharles opened 3 years ago
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
❓ Questions and Help
What is your question?
If we fine-tune RoBERTa on RTE (or another GLUE dataset) and then perform inference at test time on the dev set, is it guaranteed that the dev set is withheld during Quant-noise training? Furthermore, are the valid and dev sets the same? In ./fairseq_cli/train.py, it seems like we load a valid subset on the last checkpoint with
Load valid dataset (we load training data below, based on the latest checkpoint)
and on each epoch of training during Quant-noise, we invoke a
validate
. However, upon closer inspection of the GLUE pre-processing script in ./examples/roberta/preprocess_GLUE_tasks.sh, on line 183 in the shell script:awk '{print $1 / 5.0 }' "$TASK_DATA_FOLDER/processed/dev.label" > "$TASK-bin/label/valid.label"
you move the dev set labels to the valid set labels. Furthermore, the number of iterations for valid and dev set are the same. Therefore, we have reason to believe that the dev set and valid set are the same. However, when we run the inference script with, for example, Quant-Noise fine-tuning on RTE with an int 8 scheme, the best valid accuracy at valid time differs from the dev accuracy at inference time. We also verified that the model architectures are the same at inference time and validation time, i.e. we are evaluating on the same quantized model. Likewise, neither the dev set nor the valid set is shuffled, i.e. in `./fairseq_cli/train.py' on line 404, we instantiate the data iterator with:
Initialize data iterator
Given that the model architectures are the same at valid and test time, and we believe valid and dev to be the same datasets (without any shuffling), then it is curious that the dev and valid accuracies are different. It could be the case that the during training and post-quantization evaluation scripts differ somehow.
Code
We fine-tune RoBERTa on RTE using Quant-Noise with `int 8`:` TOTAL_NUM_UPDATES=2036 WARMUP_UPDATES=122 LR=2e-05 NUM_CLASSES=2 MAX_SENTENCES=16 ROBERTA_PATH=roberta_base/model.pt RTE_PATH=RTE-bin/ SAVE_DIR="checkpoint/roberta/rte-scalar-8-quant-noise echo "saving to $SAVE_DIR" PYTHONPATH="~/Quant-Noisier/fairseq" python -m fairseq_cli.train $RTE_PATH \ --restore-file $ROBERTA_PATH \ --max-positions 512 \ --batch-size $MAX_SENTENCES \ --max-tokens 4400 \ --task sentence_prediction \ --reset-optimizer --reset-dataloader --reset-meters \ --required-batch-size-multiple 1 \ --init-token 0 --separator-token 2 \ --arch roberta_base \ --criterion sentence_prediction \ --num-classes $NUM_CLASSES \ --dropout 0.1 --attention-dropout 0.1 \ --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \ --clip-norm 0.0 \ --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \ --max-epoch 10 \ --find-unused-parameters \ --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \ --ddp-backend legacy_ddp \ --quant-noise-scalar 0.5 \ --save-dir $SAVE_DIR \ --bits 8 \ --update-freq $UPDATE_FREQ \ and then perform inference on the dev set: from fairseq.models.roberta import RobertaModel roberta = RobertaModel.from_pretrained( 'checkpoints/', checkpoint_file='checkpoint_best.pt', data_name_or_path='RTE-bin' ) label_fn = lambda label: roberta.task.label_dictionary.string( [label + roberta.task.label_dictionary.nspecial] ) ncorrect, nsamples = 0, 0 roberta.cuda() roberta.eval() with open('glue_data/RTE/dev.tsv') as fin: fin.readline() for index, line in enumerate(fin): tokens = line.strip().split('\t') sent1, sent2, target = tokens[1], tokens[2], tokens[3] tokens = roberta.encode(sent1, sent2) prediction = roberta.predict('sentence_classification_head', tokens).argmax().item() prediction_label = label_fn(prediction) ncorrect += int(prediction_label == target) nsamples += 1 print('| Accuracy: ', float(ncorrect)/float(nsamples)) #### What's your environment? - fairseq Version (e.g., 1.0 or master): - PyTorch Version 1.7.1 - OS (e.g., Linux): Ubuntu 18.04 - How you installed fairseq (`pip`, source): pip - Build command you used (if compiling from source): python setup.py build develop - Python version: 3.7.9 - CUDA/cuDNN version: 10.1.243 - GPU models and configuration: --1X K80 GPU --NC6 Azure Instance --56GiB RAM --340 GiB Temporary storage --Core 6 - Any other relevant information: