[Bug] ValueError: Can't infer missing attention mask on `mps` device. Please provide an `attention_mask` or use a different device. #3758

Open yukiarimo opened 1 month ago

yukiarimo commented 1 month ago

Describe the bug

 > Using model: xtts
 > Text splitted to sentences.
['Hello World']
Traceback (most recent call last):
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/gradio/", line 534, in predict
    output = await route_utils.call_process_api(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/gradio/", line 226, in call_process_api
    output = await app.get_blocks().process_api(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/gradio/", line 1550, in process_api
    result = await self.call_function(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/gradio/", line 1185, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/anyio/", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/anyio/_backends/", line 2144, in run_sync_in_worker_thread
    return await future
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/anyio/_backends/", line 851, in run
    result =, *args)
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/gradio/", line 661, in wrapper
    response = f(*args, **kwargs)
  File "/Users/yuki/Music/Ivy/pho/", line 12, in clone
    tts.tts_to_file(text=text, speaker_wav=audio, language="en", file_path="./output.wav")
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/TTS/", line 432, in tts_to_file
    wav = self.tts(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/TTS/", line 364, in tts
    wav = self.synthesizer.tts(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/TTS/utils/", line 383, in tts
    outputs = self.tts_model.synthesize(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/TTS/tts/models/", line 397, in synthesize
    return self.inference_with_config(text, config, ref_audio_path=speaker_wav, language=language, **kwargs)
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/TTS/tts/models/", line 419, in inference_with_config
    return self.full_inference(text, ref_audio_path, language, **settings)
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/torch/utils/", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/TTS/tts/models/", line 488, in full_inference
    return self.inference(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/torch/utils/", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/TTS/tts/models/", line 539, in inference
    gpt_codes = self.gpt.generate(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/TTS/tts/layers/xtts/", line 590, in generate
    gen = self.gpt_inference.generate(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/torch/utils/", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/transformers/generation/", line 1569, in generate
    model_kwargs["attention_mask"] = self._prepare_attention_mask_for_generation(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/transformers/generation/", line 468, in _prepare_attention_mask_for_generation
    raise ValueError(
ValueError: Can't infer missing attention mask on `mps` device. Please provide an `attention_mask` or use a different device.

To Reproduce

Run this:

import gradio as gr
import torch
from TTS.api import TTS
import os
os.environ["COQUI_TOS_AGREED"] = "1"

device = "mps"

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

def clone(text, audio):
    tts.tts_to_file(text=text, speaker_wav=audio, language="en", file_path="./output.wav")
    return "./output.wav"

iface = gr.Interface(fn=clone, 
                     inputs=[gr.Textbox(label='Text'),gr.Audio(type='filepath', label='Voice reference audio file')], 
                     title='Voice Clone',
                     theme = gr.themes.Base(primary_hue="teal",secondary_hue="teal",neutral_hue="slate"))

Expected behavior

    "CUDA": {
        "GPU": [],
        "available": false,
        "version": null
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.3.0",
        "TTS": "0.21.3",
        "numpy": "1.22.0"
    "System": {
        "OS": "Darwin",
        "architecture": [
        "processor": "arm",
        "python": "3.9.19",
        "version": "Darwin Kernel Version 23.5.0: Wed May  1 20:12:58 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6000"

Additional context

Hardware: MacBook Pro M1

the-homeless-god commented 1 month ago

Same issue for Apple M1 Max

the-homeless-god commented 1 month ago

But as I understood the project will not be supported yet, so we need to figure out together how to fix it

the-homeless-god commented 1 month ago

What I have found:


There's a topic under to implement missing functionality

What I have did locally as workaround:

open .venv/lib/python3.10/site-packages/transformers/generation/

and comment the part for mps


WahomeKezia commented 2 weeks ago

Same issue from Pro M1 too

I am running "distilgpt2". and "t5-small" models for simple prompts

I guess you would need more computational power and storage to manage the model check-points

leodeveloper commented 2 weeks ago

Same issue from Mac book air M1

I am running "tts_models/multilingual/multi-dataset/xtts_v2".