[Bug] ValueError: Can't infer missing attention mask on `mps` device. Please provide an `attention_mask` or use a different device.

yukiarimo commented 1 month ago

Describe the bug

(ai) (base) yuki@yuki pho % python tts.py
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
 > Downloading model to /Users/yuki/Library/Application Support/tts/tts_models--multilingual--multi-dataset--xtts_v2
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.87G/1.87G [01:04<00:00, 29.1MiB/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.37k/4.37k [00:00<00:00, 11.8kiB/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 361k/361k [00:00<00:00, 633kiB/s]
 > Model's license - CPML                                                                                                           | 0.00/32.0 [00:00<?, ?iB/s]
 > Check https://coqui.ai/cpml.txt for more info.
 > Using model: xtts
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
IMPORTANT: You are using gradio version 3.48.0, however version 4.29.0 is available, please upgrade.
--------
/opt/anaconda3/envs/ai/lib/python3.9/site-packages/gradio/processing_utils.py:188: UserWarning: Trying to convert audio automatically from int32 to 16-bit int format.
  warnings.warn(warning.format(data.dtype))
 > Text splitted to sentences.
['Hello World']
Traceback (most recent call last):
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/gradio/routes.py", line 534, in predict
    output = await route_utils.call_process_api(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/gradio/route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/gradio/blocks.py", line 1550, in process_api
    result = await self.call_function(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/gradio/blocks.py", line 1185, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 2144, in run_sync_in_worker_thread
    return await future
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 851, in run
    result = context.run(func, *args)
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/gradio/utils.py", line 661, in wrapper
    response = f(*args, **kwargs)
  File "/Users/yuki/Music/Ivy/pho/tts.py", line 12, in clone
    tts.tts_to_file(text=text, speaker_wav=audio, language="en", file_path="./output.wav")
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/TTS/api.py", line 432, in tts_to_file
    wav = self.tts(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/TTS/api.py", line 364, in tts
    wav = self.synthesizer.tts(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/TTS/utils/synthesizer.py", line 383, in tts
    outputs = self.tts_model.synthesize(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/TTS/tts/models/xtts.py", line 397, in synthesize
    return self.inference_with_config(text, config, ref_audio_path=speaker_wav, language=language, **kwargs)
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/TTS/tts/models/xtts.py", line 419, in inference_with_config
    return self.full_inference(text, ref_audio_path, language, **settings)
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/TTS/tts/models/xtts.py", line 488, in full_inference
    return self.inference(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/TTS/tts/models/xtts.py", line 539, in inference
    gpt_codes = self.gpt.generate(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/TTS/tts/layers/xtts/gpt.py", line 590, in generate
    gen = self.gpt_inference.generate(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/transformers/generation/utils.py", line 1569, in generate
    model_kwargs["attention_mask"] = self._prepare_attention_mask_for_generation(
  File "/opt/anaconda3/envs/ai/lib/python3.9/site-packages/transformers/generation/utils.py", line 468, in _prepare_attention_mask_for_generation
    raise ValueError(
ValueError: Can't infer missing attention mask on `mps` device. Please provide an `attention_mask` or use a different device.

To Reproduce

Run this:

import gradio as gr
import torch
from TTS.api import TTS
import os
os.environ["COQUI_TOS_AGREED"] = "1"

device = "mps"

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

def clone(text, audio):
    tts.tts_to_file(text=text, speaker_wav=audio, language="en", file_path="./output.wav")
    return "./output.wav"

iface = gr.Interface(fn=clone, 
                     inputs=[gr.Textbox(label='Text'),gr.Audio(type='filepath', label='Voice reference audio file')], 
                     outputs=gr.Audio(type='filepath'),
                     title='Voice Clone',
                     theme = gr.themes.Base(primary_hue="teal",secondary_hue="teal",neutral_hue="slate"))
iface.launch()

Expected behavior

No response

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [],
        "available": false,
        "version": null
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.3.0",
        "TTS": "0.21.3",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Darwin",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "arm",
        "python": "3.9.19",
        "version": "Darwin Kernel Version 23.5.0: Wed May  1 20:12:58 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6000"
    }
}

Additional context

Hardware: MacBook Pro M1

the-homeless-god commented 1 month ago

Same issue for Apple M1 Max

the-homeless-god commented 1 month ago

But as I understood the project will not be supported yet, so we need to figure out together how to fix it

the-homeless-god commented 1 month ago

What I have found:

PYTORCH_ENABLE_MPS_FALLBACK=1

There's a topic under https://github.com/pytorch/pytorch/issues/77764 to implement missing functionality

What I have did locally as workaround:

open .venv/lib/python3.10/site-packages/transformers/generation/utils.py

and comment the part for mps

WahomeKezia commented 2 weeks ago

Same issue from Pro M1 too

I am running "distilgpt2". and "t5-small" models for simple prompts

I guess you would need more computational power and storage to manage the model check-points

leodeveloper commented 2 weeks ago

Same issue from Mac book air M1

I am running "tts_models/multilingual/multi-dataset/xtts_v2".

coqui-ai / TTS