Problem with training transformer translation model with quant-noise-pq

jahutwb commented 4 years ago

What is your question?

I'm trying to train translation model IWSLT'14 German to English (Transformer) from examples, https://github.com/pytorch/fairseq/tree/master/examples/translation with quant-noise-pq.

But it seems not to train at all. During training loss drops from 9.9 to 9.7 and every hypothesis that model reurns looks like that:

example hypothesis: ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, or example hypothesis: so and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and

I run this training like that:

CUDA_VISIBLE_DEVICES=0 fairseq-train \
    data-bin/iwslt14.tokenized.de-en \
    --arch transformer_iwslt_de_en --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 4096 \
    --eval-bleu \
    --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
    --eval-bleu-detok moses \
    --eval-bleu-remove-bpe \
    --eval-bleu-print-samples \
    --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
    --quant-noise-pq 0.1 --quant-noise-pq-block-size 8

What have you tried?

I've tried --quant-noise-pq-block-size 8, 4, 2, 1 --quant-noise-pq 0.01 and still the same problem

What's your environment?

fairseq Version (0.9.0 master from 25 May):
PyTorch Version 1.5.0
OS (e.g., Linux): Ubuntu 16.04.4 LTS

How you installed fairseq:

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

Python version: 3.6.9
CUDA/cuDNN version: Cuda compilation tools, release 8.0, V8.0.61
GPU models and configuration: GeForce GTX 1080

jahutwb commented 4 years ago

Despite the fact that I have not yet manage to train this model with quant noise, I quantized the model trained without this noise. But had troubles with running generate script with this quantized model. There was problem with loading state dict of quantized model in checkpoint_utils.py in load_model_ensemble_and_task.

I did some hack, but not the pretty one. I quantized model built by task.build_model(args)

        if args.quantization_config_path is not None:
            from fairseq import quantization_utils
            from fairseq.modules.quantization import pq
            quantizer = quantization_utils.Quantizer(
                config_path=args.quantization_config_path,
                max_epoch=6,   #ugly line
                max_update=args.max_update,
            )
            for step in range(len(quantizer.layers_to_quantize)):
                logger.info(
                    'quantizing model (step={}; layers_to_quantize[step]={})'.format(
                        step, quantizer.layers_to_quantize[step]
                    )
                )
                quantized_layers = pq.quantize_model_(
                    model,
                    pq.SizeTracker(model),
                    quantizer.layers_to_quantize,
                    quantizer.block_sizes_config,
                    quantizer.n_centroids_config,
                    step=step,
                )

It works, but it takes time to quantize it and I have to do something with that max_epoch argument. Simply adding as an argument for generate don't work, because generate doesn't like such argument.

huihuifan commented 4 years ago

Thanks for raising this issue, we are looking into it

reachtarunhere commented 4 years ago

I am facing a related problem. Unlike @jahutwb however I was able to train a working model with okaish BLEU while using the noise. My dataset is different with a size of 1.7M sentences but I kept the parameters same for the transformer. I used a lower --quant-noise-pq 0.05 and the model seems to work. I have still not managed to quantize it after training with the noise and facing issue with the load_state_dict. I would try what @jahutwb's hack and see if I can make progress.

Also are you too facing the issue with load_checkpoint in checkpoint_utils.py?

jahutwb commented 4 years ago

To quantize trained model i run additional training with an argument --quantization-config-path examples/quant_noise/transformer_quantization_config.yaml I had to modify this config file a little bit. I've changed regular expressions in layers_to_quantize, because it has problems with finding those layers. I've replaced "\\" with "\", changed "decoder\\.embed_tokens\\.embeddings\\.[012]\\.[01]" to "decoder\.embed_tokens" and add similar expressions to quantize encoder layers. It's important to add --max_epoch argument during this quantization training. Max epoch has to be a multiple of layers_to_quantize.

reachtarunhere commented 4 years ago

Traceback (most recent call last): File "/home/tarun/anaconda3/bin/fairseq-train", line 11, in load_entry_point('fairseq', 'console_scripts', 'fairseq-train')() File "/home/tarun/fairseq/fairseq_cli/train.py", line 370, in cli_main nprocs=args.distributed_world_size, File "/home/tarun/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/tarun/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/tarun/fairseq/fairseq/trainer.py", line 238, in load_checkpoint state["model"], strict=True, args=self.args File "/home/tarun/fairseq/fairseq/models/fairseq_model.py", line 93, in load_state_dict return super().load_state_dict(new_state_dict, strict) File "/home/tarun/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 830, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for TransformerModel: Unexpected key(s) in state_dict: "encoder.quant_noise.weight", "decoder.quant_noise.weight".

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/tarun/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/home/tarun/fairseq/fairseq_cli/train.py", line 338, in distributed_main main(args, init_distributed=True) File "/home/tarun/fairseq/fairseq_cli/train.py", line 104, in main extra_state, epoch_itr = checkpoint_utils.load_checkpoint(args, trainer) File "/home/tarun/fairseq/fairseq/checkpoint_utils.py", line 137, in load_checkpoint reset_meters=args.reset_meters, File "/home/tarun/fairseq/fairseq/trainer.py", line 249, in load_checkpoint "please ensure that the architectures match.".format(filename) Exception: Cannot load model parameters from checkpoint checkpoints/base-quant-model/checkpoint_best.pt; please ensure that the architectures match.

I have tried adding the original yaml file and it doesn't seem to work. Any hints or hacks? @jahutwb

jahutwb commented 4 years ago

I didn't have exactly same error, because I did'n manage to successfully run training with quant-noise, but I see what is the problem. It seems that it changes model sate dict keys during training with quant noise. I had similar problem because running training with quantization-config also changes these keys. During lading checkpoint it previously built new model task.build_model(args) and then it loads state_dict, but that freshly built model have different keys then trained with quant noise (in your case) or quantized (in my). I handled this, as I described above, by quantize that freshly build model and then load my quantized state dict. I suppose you could try something similar, and do with this freshly build model what train does during training with quant noise. But I'm not an expert and I don't know if my advice is good.

Could you tell me how you run your trainning with quant noise? What model do you train? Does your model output sensible hypotheses?

reachtarunhere commented 4 years ago

Thanks for your response. I am going to try but since I am not an expert either I might require a bit more handholding. For training with quant noise I used 0.05 for noise and 8 for block size. I trained on my own dataset. All of my other parameters are the same as yours including the architecture. My dataset is 1.7M sentence pairs and I trained for around 30 epochs but I got the model outputting sensible hypothesis around the 10th epoch itself. My training command other than the data folder is the same as yours.

I handled this, as I described above, by quantize that freshly build model and then load my quantized state dict.

I am not exactly clear by what you mean by this. Do you write your own training loop for doing the quantization or you make some changes to the trainer.py or checkpoint_utils.py. Would be nice if you can share the snippet as a gist.

My email is on my profile if it would help we can try debugging the issues together over Zoom?

Thanks!

Update: I have managed to train with quant noise as well as quantize that model without severe loss in metrics. Will share the results and how I managed to do it shortly.

Shamdan17 commented 4 years ago

@reachtarunhere I am having a similar issue with the loss not decreasing and stuck at around 9-10 after 15 epochs. Without quantization noise, it drops down to less than 9 after the 2nd epoch. Can you please share your workaround for me and others who would like to use quantization?

Fairseq version: 0.9.0 Pytorch version: 1.4.0

Command used to train: fairseq-train ~/data-bin/ --arch transformer --optimizer adam --lr 0.0005 --max-tokens 6000 --save-dir . --max-epoch 40 --save-interval 2 --update-freq 1 --optimizer adam --adam-betas '(0.9, 0.98)' --warmup-updates 2000 --warmup-init-lr '1e-07' --lr-scheduler inverse_sqrt --min-lr '1e-09' --task translation --quant-noise-pq 0.1 --quant-noise-pq-block-size 8

Thanks!

lsawaniewski commented 3 years ago

@reachtarunhere are you able to share any details or training procedure for quantizing transformer without loss in metrics? My attempts seem to work but the results are much weaker than the original model.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

AIikai commented 2 years ago

Thanks for your response. I am going to try but since I am not an expert either I might require a bit more handholding. For training with quant noise I used 0.05 for noise and 8 for block size. I trained on my own dataset. All of my other parameters are the same as yours including the architecture. My dataset is 1.7M sentence pairs and I trained for around 30 epochs but I got the model outputting sensible hypothesis around the 10th epoch itself. My training command other than the data folder is the same as yours.

I handled this, as I described above, by quantize that freshly build model and then load my quantized state dict.

I am not exactly clear by what you mean by this. Do you write your own training loop for doing the quantization or you make some changes to the trainer.py or checkpoint_utils.py. Would be nice if you can share the snippet as a gist.

My email is on my profile if it would help we can try debugging the issues together over Zoom?

Thanks!

Update: I have managed to train with quant noise as well as quantize that model without severe loss in metrics. Will share the results and how I managed to do it shortly.

I also tried, but the results of the quant noise model were 4 or 5 points worse than those of the unused model, and the quantized model lost 2 points. Can you share your experience? Thanks.

facebookresearch / fairseq