"--print-alignment" argument drastically slows down the generation

ladler0320 commented 4 years ago

🐛 Bug

Using the --print-alignment argument makes the generation up to 3x times slower (Depends on the batch size). For example, generating translation for my test set took 47.9s (134.54 sentences/s, 2485.45 tokens/s) with --print-alignment option and 20.5s (315.09 sentences/s, 5820.76 tokens/s) without it.

The issue does not occur on earlier fairseq versions (I use fs-0.8.0 from some October or November commit)

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Generate the translation without --print-alignment option
Generate the translation with --print-alignment option
Compare the performance.

Expected behavior

--print-alignment argument won't drastically slow down the generation

Environment

fairseq Version (e.g., 1.0 or master): master., last checked on 242269d
PyTorch Version (e.g., 1.0): 1.5.0 with CUDA 10.1
OS (e.g., Linux): Ubuntu 18.04
How you installed fairseq (pip, source): source
Build command you used (if compiling from source): pip install -e .
Python version: 3.7
CUDA/cuDNN version: 10.1
GPU models and configuration: GTX1080
Any other relevant information: the issue may be related to closed #2173

kalyangvs commented 4 years ago

Reason . The speed does not vary when generated on CPU (with and without flag). Hence moved the tensors to CPU. But this is unnoticed because I might have tried lower batch-sizes. Is this due to the time taken to move tensors from GPU to CPU? If so, should extract hard_alignments computation should be made GPU-friendly. @myleott

ladler0320 commented 4 years ago

Reason . The speed does not vary when generated on CPU (with and without flag). Hence moved the tensors to CPU. But this is unnoticed because I might have tried lower batch-sizes. Is this due to the time taken to move tensors from GPU to CPU? If so, should extract hard_alignments computation should be made GPU-friendly. @myleott

@gvskalyan, thanks for the reply. The drop in generation speed on GPU is noticeable on commits after moving tensors to CPU as well as before it. Even on smaller batches, like 32, the speed is ~1.7x lower with --print-alignment option. However, you are right, there is no difference in generation speed with --cpu flag.

facebookresearch / fairseq