A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
Tacotron 2 TTS model inference is not thread-safe despite being in eval mode and results in various runtime errors (shared below) due to shared decoder internal states being written to by multiple threads. Even when a runtime error doesn't occur, the resulting audios are gibberish and noisy and have almost no coherent speech.
[2022-03-05 12:38:40,296] ERROR in app: Exception on /api/synthesize/ [POST]
Traceback (most recent call last):
...
[application specific traceback]
...
File "./common/tts.py", line 22, in synthesize_text
y, sr = tts(text, language_code)
File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/tts_middleware/core.py", line 24, in _tts
y, sr = tts_function(raw_text, language_code)
File "./common/inference.py", line 63, in predict
mel = eval(f"inference_{self.text2mel_type}")(self, text)
File "./common/infer/infer_nemo.py", line 4, in inference_nemo
return obj.model["text2mel"].generate_spectrogram(tokens=parsed)
File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/nemo/core/classes/common.py", line 798, in __call__
outputs = wrapped(*args, **kwargs)
File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/nemo/collections/tts/models/tacotron2.py", line 203, in generate_spectrogram
tensors = self(tokens=tokens, token_len=token_len)
File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/nemo/core/classes/common.py", line 798, in __call__
outputs = wrapped(*args, **kwargs)
File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/nemo/collections/tts/models/tacotron2.py", line 187, in forward
memory=encoder_embedding, memory_lengths=token_len
File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/nemo/core/classes/common.py", line 798, in __call__
outputs = wrapped(*args, **kwargs)
File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/nemo/collections/tts/modules/tacotron2.py", line 204, in forward
return self.infer(**kwargs)
File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/nemo/collections/tts/modules/tacotron2.py", line 325, in infer
mel_output, gate_output, alignment = self.decode(decoder_input)
File "/root/.cache/pypoetry/virtualenvs/tts-service-Fv16d9lr-py3.7/lib/python3.7/site-packages/nemo/collections/tts/modules/tacotron2.py", line 265, in decode
(self.attention_weights.unsqueeze(1), self.attention_weights_cum.unsqueeze(1)), dim=1,
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 49 but got size 53 for tensor number 1 in the list.
...
[similar errors for most concurrent requests]
...
Steps/Code to reproduce bug
Load Tacotron2 model in any multi-threaded inference application (in my case, a flask server app)
Make multiple concurrent requests to the application serving the model with sufficiently different text inputs
Expected behavior
Expected thread-safe read-only inference from the Tacotron2 TTS model resulting in sensible audios identical to audios produced by single-threaded inference.
Environment overview (please complete the following information)
Environment location: Base docker image nvidia/cuda:10.2-base-ubuntu18.04 with Python 3.7
Method of NeMo install: pip install nemo
Additional context
Simple fix: internal decoder states should not be shared Variables that can be written to by concurrent invocations of inference function.
Describe the bug
Tacotron 2 TTS model inference is not thread-safe despite being in eval mode and results in various runtime errors (shared below) due to shared decoder internal states being written to by multiple threads. Even when a runtime error doesn't occur, the resulting audios are gibberish and noisy and have almost no coherent speech.
Steps/Code to reproduce bug
Expected behavior
Expected thread-safe read-only inference from the Tacotron2 TTS model resulting in sensible audios identical to audios produced by single-threaded inference.
Environment overview (please complete the following information)
nvidia/cuda:10.2-base-ubuntu18.04
with Python 3.7Additional context
Simple fix: internal decoder states should not be shared Variables that can be written to by concurrent invocations of inference function.
Reference: https://discuss.pytorch.org/t/is-inference-thread-safe/88583