facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.56k stars 6.41k forks source link

RoBERTa Scalar Quantization / Scalar Quant Noise #3295

Open wells853 opened 3 years ago

wells853 commented 3 years ago

🐛 Bug

Scalar Quantization does not seem to work on a pretrained RoBERTa model.

To Reproduce

Script to run without quantization

TOTAL_NUM_UPDATES=2036 WARMUP_UPDATES=122 LR=2e-05 NUM_CLASSES=2 MAX_SENTENCES=4 ROBERTA_PATH=roberta_base/model.pt RTE_PATH=RTE-bin/ SAVE_DIR=checkpoint/roberta/rte-no-quant-noise UPDATE_FREQ=4

python -m fairseq_cli.train $RTE_PATH \ --restore-file $ROBERTA_PATH \ --max-positions 512 \ --batch-size $MAX_SENTENCES \ --max-tokens 4400 \ --task sentence_prediction \ --reset-optimizer --reset-dataloader --reset-meters \ --required-batch-size-multiple 1 \ --init-token 0 --separator-token 2 \ --arch roberta_base \ --criterion sentence_prediction \ --num-classes $NUM_CLASSES \ --dropout 0.1 --attention-dropout 0.1 \ --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \ --clip-norm 0.0 \ --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \ --max-epoch 10 \ --find-unused-parameters \ --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \ --ddp-backend legacy_ddp \ --save-dir $SAVE_DIR \ --update-freq $UPDATE_FREQ

Script to run with quantization

TOTAL_NUM_UPDATES=2036 WARMUP_UPDATES=122 LR=2e-05 NUM_CLASSES=2 MAX_SENTENCES=4 ROBERTA_PATH=roberta_base/model.pt RTE_PATH=RTE-bin/ SAVE_DIR=checkpoint/roberta/rte-no-quant-noise UPDATE_FREQ=4

python -m fairseq_cli.train $RTE_PATH \ --restore-file $ROBERTA_PATH \ --max-positions 512 \ --batch-size $MAX_SENTENCES \ --max-tokens 4400 \ --task sentence_prediction \ --reset-optimizer --reset-dataloader --reset-meters \ --required-batch-size-multiple 1 \ --init-token 0 --separator-token 2 \ --arch roberta_base \ --criterion sentence_prediction \ --num-classes $NUM_CLASSES \ --dropout 0.1 --attention-dropout 0.1 \ --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \ --clip-norm 0.0 \ --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \ --max-epoch 10 \ --find-unused-parameters \ --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \ --ddp-backend legacy_ddp \ --save-dir $SAVE_DIR \ --update-freq $UPDATE_FREQ \ --quant-noise-scalar 0.5

Note that these two scripts are identical, except for the quant-noise-scalar argument.

Code sample

Expected behavior

We would expect that these two scripts would train differently. But instead, they train identically: the RoBERTa model in the second script is not quantized.

Environment

Additional context

First time submitting an issue here so apologies if anything is incorrect. Thanks for any help!

wells853 commented 3 years ago

Ah, I think I see the issue. In fairseq/tasks/sentence_prediction.py,

I changed build_model(self, args) to read as follows (similar to how it is in fairseq/tasks/fairseq_task.py):

`def build_model(self, args): from fairseq import models, quantization_utils

    model = models.build_model(args, self)

    model.register_classification_head(
        getattr(args, "classification_head_name", "sentence_classification_head"),
        num_classes=self.args.num_classes,
    )
    model = quantization_utils.quantize_model_scalar(model, args)
    return model`

This seems to be working all right so far, but I'll keep you posted. Happy to open a PR later with this and some other changes (e.g. implementing 1 and 4 bit quantization)