Open wells853 opened 3 years ago
Ah, I think I see the issue. In fairseq/tasks/sentence_prediction.py,
I changed build_model(self, args) to read as follows (similar to how it is in fairseq/tasks/fairseq_task.py):
`def build_model(self, args): from fairseq import models, quantization_utils
model = models.build_model(args, self)
model.register_classification_head(
getattr(args, "classification_head_name", "sentence_classification_head"),
num_classes=self.args.num_classes,
)
model = quantization_utils.quantize_model_scalar(model, args)
return model`
This seems to be working all right so far, but I'll keep you posted. Happy to open a PR later with this and some other changes (e.g. implementing 1 and 4 bit quantization)
🐛 Bug
Scalar Quantization does not seem to work on a pretrained RoBERTa model.
To Reproduce
Script to run without quantization
TOTAL_NUM_UPDATES=2036 WARMUP_UPDATES=122 LR=2e-05 NUM_CLASSES=2 MAX_SENTENCES=4 ROBERTA_PATH=roberta_base/model.pt RTE_PATH=RTE-bin/ SAVE_DIR=checkpoint/roberta/rte-no-quant-noise UPDATE_FREQ=4
python -m fairseq_cli.train $RTE_PATH \ --restore-file $ROBERTA_PATH \ --max-positions 512 \ --batch-size $MAX_SENTENCES \ --max-tokens 4400 \ --task sentence_prediction \ --reset-optimizer --reset-dataloader --reset-meters \ --required-batch-size-multiple 1 \ --init-token 0 --separator-token 2 \ --arch roberta_base \ --criterion sentence_prediction \ --num-classes $NUM_CLASSES \ --dropout 0.1 --attention-dropout 0.1 \ --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \ --clip-norm 0.0 \ --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \ --max-epoch 10 \ --find-unused-parameters \ --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \ --ddp-backend legacy_ddp \ --save-dir $SAVE_DIR \ --update-freq $UPDATE_FREQ
Script to run with quantization
TOTAL_NUM_UPDATES=2036 WARMUP_UPDATES=122 LR=2e-05 NUM_CLASSES=2 MAX_SENTENCES=4 ROBERTA_PATH=roberta_base/model.pt RTE_PATH=RTE-bin/ SAVE_DIR=checkpoint/roberta/rte-no-quant-noise UPDATE_FREQ=4
python -m fairseq_cli.train $RTE_PATH \ --restore-file $ROBERTA_PATH \ --max-positions 512 \ --batch-size $MAX_SENTENCES \ --max-tokens 4400 \ --task sentence_prediction \ --reset-optimizer --reset-dataloader --reset-meters \ --required-batch-size-multiple 1 \ --init-token 0 --separator-token 2 \ --arch roberta_base \ --criterion sentence_prediction \ --num-classes $NUM_CLASSES \ --dropout 0.1 --attention-dropout 0.1 \ --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \ --clip-norm 0.0 \ --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \ --max-epoch 10 \ --find-unused-parameters \ --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \ --ddp-backend legacy_ddp \ --save-dir $SAVE_DIR \ --update-freq $UPDATE_FREQ \ --quant-noise-scalar 0.5
Note that these two scripts are identical, except for the quant-noise-scalar argument.
Code sample
Expected behavior
We would expect that these two scripts would train differently. But instead, they train identically: the RoBERTa model in the second script is not quantized.
Environment
pip
, source): sourceAdditional context
First time submitting an issue here so apologies if anything is incorrect. Thanks for any help!