facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.53k stars 6.41k forks source link

Validation set and development set clarification for fine-tuning RoBERTa with Quant-Noise. #3369

Open sergiogcharles opened 3 years ago

sergiogcharles commented 3 years ago

❓ Questions and Help

What is your question?

If we fine-tune RoBERTa on RTE (or another GLUE dataset) and then perform inference at test time on the dev set, is it guaranteed that the dev set is withheld during Quant-noise training? Furthermore, are the valid and dev sets the same? In ./fairseq_cli/train.py, it seems like we load a valid subset on the last checkpoint with

Load valid dataset (we load training data below, based on the latest checkpoint)

for valid_sub_split in cfg.dataset.valid_subset.split(","):
    task.load_dataset(valid_sub_split, combine=False, epoch=1)

and on each epoch of training during Quant-noise, we invoke a validate. However, upon closer inspection of the GLUE pre-processing script in ./examples/roberta/preprocess_GLUE_tasks.sh, on line 183 in the shell script:

awk '{print $1 / 5.0 }' "$TASK_DATA_FOLDER/processed/dev.label" > "$TASK-bin/label/valid.label"

you move the dev set labels to the valid set labels. Furthermore, the number of iterations for valid and dev set are the same. Therefore, we have reason to believe that the dev set and valid set are the same. However, when we run the inference script with, for example, Quant-Noise fine-tuning on RTE with an int 8 scheme, the best valid accuracy at valid time differs from the dev accuracy at inference time. We also verified that the model architectures are the same at inference time and validation time, i.e. we are evaluating on the same quantized model. Likewise, neither the dev set nor the valid set is shuffled, i.e. in `./fairseq_cli/train.py' on line 404, we instantiate the data iterator with:

Initialize data iterator

    itr = trainer.get_valid_iterator(subset).next_epoch_itr(
        shuffle=False, set_dataset_epoch=False  # use a fixed valid set
    )

Given that the model architectures are the same at valid and test time, and we believe valid and dev to be the same datasets (without any shuffling), then it is curious that the dev and valid accuracies are different. It could be the case that the during training and post-quantization evaluation scripts differ somehow.

Code

We fine-tune RoBERTa on RTE using Quant-Noise with `int 8`:` TOTAL_NUM_UPDATES=2036 WARMUP_UPDATES=122 LR=2e-05 NUM_CLASSES=2 MAX_SENTENCES=16 ROBERTA_PATH=roberta_base/model.pt RTE_PATH=RTE-bin/ SAVE_DIR="checkpoint/roberta/rte-scalar-8-quant-noise echo "saving to $SAVE_DIR" PYTHONPATH="~/Quant-Noisier/fairseq" python -m fairseq_cli.train $RTE_PATH \ --restore-file $ROBERTA_PATH \ --max-positions 512 \ --batch-size $MAX_SENTENCES \ --max-tokens 4400 \ --task sentence_prediction \ --reset-optimizer --reset-dataloader --reset-meters \ --required-batch-size-multiple 1 \ --init-token 0 --separator-token 2 \ --arch roberta_base \ --criterion sentence_prediction \ --num-classes $NUM_CLASSES \ --dropout 0.1 --attention-dropout 0.1 \ --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \ --clip-norm 0.0 \ --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \ --max-epoch 10 \ --find-unused-parameters \ --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \ --ddp-backend legacy_ddp \ --quant-noise-scalar 0.5 \ --save-dir $SAVE_DIR \ --bits 8 \ --update-freq $UPDATE_FREQ \ and then perform inference on the dev set: from fairseq.models.roberta import RobertaModel roberta = RobertaModel.from_pretrained( 'checkpoints/', checkpoint_file='checkpoint_best.pt', data_name_or_path='RTE-bin' ) label_fn = lambda label: roberta.task.label_dictionary.string( [label + roberta.task.label_dictionary.nspecial] ) ncorrect, nsamples = 0, 0 roberta.cuda() roberta.eval() with open('glue_data/RTE/dev.tsv') as fin: fin.readline() for index, line in enumerate(fin): tokens = line.strip().split('\t') sent1, sent2, target = tokens[1], tokens[2], tokens[3] tokens = roberta.encode(sent1, sent2) prediction = roberta.predict('sentence_classification_head', tokens).argmax().item() prediction_label = label_fn(prediction) ncorrect += int(prediction_label == target) nsamples += 1 print('| Accuracy: ', float(ncorrect)/float(nsamples)) #### What's your environment? - fairseq Version (e.g., 1.0 or master): - PyTorch Version 1.7.1 - OS (e.g., Linux): Ubuntu 18.04 - How you installed fairseq (`pip`, source): pip - Build command you used (if compiling from source): python setup.py build develop - Python version: 3.7.9 - CUDA/cuDNN version: 10.1.243 - GPU models and configuration: --1X K80 GPU --NC6 Azure Instance --56GiB RAM --340 GiB Temporary storage --Core 6 - Any other relevant information:
stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!