SpeechToText generate/interactive failed when source_dictionary exists

George0828Zhang commented 3 years ago

🐛 Bug

When using SpeechTextJointToTextTask (or when adding source_dictionary to SpeechToTextTask) as the task, fairseq-generate and fairseq-interactive both fails to produce the result. This is due to the following lines in generate.py and interactive.py:

if src_dict is not None:
    src_str = src_dict.string(src_tokens, cfg.common_eval.post_process)

This line assumed src_tokens are tokens (instead of features) solely based on whether task.source_dictionary is None, which is inappropriate. Instead, it should be based on the type or dimensions of src_tokens, i.e. for speech, src_tokens should be float tensor with 3 dimensions, while for text it should be long tensor with 2 dimensions.

if src_tokens.dtype == torch.long and src_tokens.dim() == 2:
    src_str = src_dict.string(src_tokens, cfg.common_eval.post_process)

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Train model using SpeechTextJointToTextTask as task (or by inheriting SpeechToTextTask and add source_dictionary)
Run either fairseq-generate or fairseq-interactive
See error

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/george/utility/fairseq/fairseq_cli/interactive.py", line 316, in <module>
    cli_main()
  File "/home/george/utility/fairseq/fairseq_cli/interactive.py", line 312, in cli_main
    distributed_utils.call_main(convert_namespace_to_omegaconf(args), main)
  File "/home/george/utility/fairseq/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/home/george/utility/fairseq/fairseq_cli/interactive.py", line 254, in main
    src_str = src_dict.string(src_tokens, cfg.common_eval.post_process)
  File "/home/george/utility/fairseq/fairseq/data/dictionary.py", line 110, in string
    sent = separator.join(
  File "/home/george/utility/fairseq/fairseq/data/dictionary.py", line 111, in <genexpr>
    token_string(i)
  File "/home/george/utility/fairseq/fairseq/data/dictionary.py", line 105, in token_string
    return self[i]
  File "/home/george/utility/fairseq/fairseq/data/dictionary.py", line 48, in __getitem__
    return self.symbols[idx]
TypeError: only integer tensors of a single element can be converted to an index

Code sample

Expected behavior

fairseq-generate or fairseq-interactive should be able to complete, even if task.source_dictionary is None. Whether to decode src_tokens using dictionary should be based on the type or dimensions of src_tokens, i.e. for speech, src_tokens should be float tensor with 3 dimensions, while for text it should be long tensor with 2 dimensions.

Environment

fairseq Version (e.g., 1.0 or master):
PyTorch Version (e.g., 1.0)
OS (e.g., Linux):
How you installed fairseq (pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

holyma commented 3 years ago

can you reproduce the result of this example (mustc en-de ST)

George0828Zhang commented 3 years ago

can you reproduce the result of this example (mustc en-de ST)

The example in the link uses the default SpeechToText, which sets source_dictionary=None, so this error would not occur. I specifically mentioned that this error only occurs when source_dictionary!=None.

holyma commented 3 years ago

Recently, I have run ST and ASR in mustc en-de, but I got a terrible result. Have you met the problem? here is my issue in #3897.

facebookresearch / fairseq