[TTS] Tacotron 2 Inference is not thread-safe

Describe the bug

Tacotron 2 TTS model inference is not thread-safe despite being in eval mode and results in various runtime errors (shared below) due to shared decoder internal states being written to by multiple threads. Even when a runtime error doesn't occur, the resulting audios are gibberish and noisy and have almost no coherent speech.

[2022-03-05 12:38:40,296] ERROR in app: Exception on /api/synthesize/ [POST]
Traceback (most recent call last):
... 
[application specific traceback]
...
  File "./common/tts.py", line 22, in synthesize_text
    y, sr = tts(text, language_code)
  File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/tts_middleware/core.py", line 24, in _tts
    y, sr = tts_function(raw_text, language_code)
  File "./common/inference.py", line 63, in predict
    mel = eval(f"inference_{self.text2mel_type}")(self, text)
  File "./common/infer/infer_nemo.py", line 4, in inference_nemo
    return obj.model["text2mel"].generate_spectrogram(tokens=parsed)
  File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/nemo/core/classes/common.py", line 798, in __call__
    outputs = wrapped(*args, **kwargs)
  File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/nemo/collections/tts/models/tacotron2.py", line 203, in generate_spectrogram
    tensors = self(tokens=tokens, token_len=token_len)
  File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/nemo/core/classes/common.py", line 798, in __call__
    outputs = wrapped(*args, **kwargs)
  File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/nemo/collections/tts/models/tacotron2.py", line 187, in forward
    memory=encoder_embedding, memory_lengths=token_len
  File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/nemo/core/classes/common.py", line 798, in __call__
    outputs = wrapped(*args, **kwargs)
  File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/nemo/collections/tts/modules/tacotron2.py", line 204, in forward
    return self.infer(**kwargs)
  File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/nemo/collections/tts/modules/tacotron2.py", line 325, in infer
    mel_output, gate_output, alignment = self.decode(decoder_input)
  File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/nemo/collections/tts/modules/tacotron2.py", line 265, in decode
    (self.attention_weights.unsqueeze(1), self.attention_weights_cum.unsqueeze(1)), dim=1,
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 49 but got size 53 for tensor number 1 in the list.

...
[similar errors for most concurrent requests]
...

Steps/Code to reproduce bug

Load Tacotron2 model in any multi-threaded inference application (in my case, a flask server app)
Make multiple concurrent requests to the application serving the model with sufficiently different text inputs

Expected behavior

Expected thread-safe read-only inference from the Tacotron2 TTS model resulting in sensible audios identical to audios produced by single-threaded inference.

Environment overview (please complete the following information)

Environment location: Base docker image nvidia/cuda:10.2-base-ubuntu18.04 with Python 3.7
Method of NeMo install: pip install nemo

Additional context

Simple fix: internal decoder states should not be shared Variables that can be written to by concurrent invocations of inference function.

Reference: https://discuss.pytorch.org/t/is-inference-thread-safe/88583

NVIDIA / NeMo

[TTS] Tacotron 2 Inference is not thread-safe #3797