as-ideas / ForwardTacotron

⏩ Generating speech in a single forward pass without any attention!
https://as-ideas.github.io/ForwardTacotron/
MIT License
579 stars 113 forks source link

Cast error details: Unable to cast [Array] to Tensor #110

Open Tony-Starkus opened 4 months ago

Tony-Starkus commented 4 months ago

Hello. I downloaded the pretrained modal ljspeech v3.1 and when I try to run python gen_forward.py --alpha 1 --checkpoint pretrained-forward_step90k.pt --input_text 'this is whatever you want it to be' griffinlim I get the following error:

Traceback (most recent call last):
  File "/home/usertest/PycharmProjects/ForwardTacotron/gen_forward.py", line 116, in <module>
    dsp.save_wav(wav, out_path / f'{wav_name}.wav')
  File "/home/usertest/PycharmProjects/ForwardTacotron/utils/dsp.py", line 103, in save_wav
    torchaudio.save(filepath=path, src=waveform, sample_rate=self.sample_rate)
  File "/home/usertest/.virtualenvs/ForwardTacotron-Python3.10/lib/python3.10/site-packages/torchaudio/backend/sox_io_backend.py", line 429, in save
    torch.ops.torchaudio.sox_io_save_audio_file(
  File "/home/usertest/.virtualenvs/ForwardTacotron-Python3.10/lib/python3.10/site-packages/torch/_ops.py", line 502, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: torchaudio::sox_io_save_audio_file() Expected a value of type 'Tensor' for argument '_1' but instead found type 'ndarray'.
Position: 1
Value: array([0.00272604, 0.00512884, 0.00484867, ..., 0.00298105, 0.00193049,
       0.00093417], dtype=float32)
Declaration: torchaudio::sox_io_save_audio_file(str _0, Tensor _1, int _2, bool _3, float? _4, str? _5, str? _6, int? _7) -> ()
Cast error details: Unable to cast [0.00272604 0.00512884 0.00484867 ... 0.00298105 0.00193049 0.00093417] to Tensor

Someone can help me?

I am runing Python 3.10 with following packages versions:

absl-py==2.1.0
attrs==23.2.0
audioread==3.0.1
Babel==2.15.0
bibtexparser==2.0.0b7
certifi==2024.7.4
cffi==1.16.0
charset-normalizer==3.3.2
clldutils==3.22.2
cmake==3.30.0
colorama==0.4.6
colorlog==6.8.2
contourpy==1.2.1
csvw==3.3.0
cycler==0.12.1
Cython==3.0.10
dataclasses==0.6
decorator==5.1.1
dlinfo==1.2.1
filelock==3.13.1
fonttools==4.53.1
fsspec==2024.2.0
grpcio==1.65.1
idna==3.7
inflect==7.3.1
isodate==0.6.1
Jinja2==3.1.3
joblib==1.4.2
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
kiwisolver==1.4.5
language-tags==1.2.0
lazy_loader==0.4
librosa==0.10.0
lit==18.1.8
llvmlite==0.39.1
lxml==5.2.2
Markdown==3.6
MarkupSafe==2.1.5
matplotlib==3.9.1
more-itertools==10.3.0
mpmath==1.3.0
msgpack==1.0.8
networkx==3.2.1
numba==0.56.4
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
packaging==24.1
pandas==2.2.2
phonemizer==3.2.1
pillow==10.2.0
platformdirs==4.2.2
pooch==1.8.2
protobuf==4.25.3
pycparser==2.22
pylatexenc==2.10
pyparsing==3.1.2
python-dateutil==2.9.0.post0
pytz==2024.1
pyworld==0.3.4
PyYAML==6.0.1
rdflib==7.0.0
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
Resemblyzer==0.1.3
rfc3986==1.5.0
rpds-py==0.19.0
scikit-learn==1.5.1
scipy==1.14.0
segments==2.2.1
six==1.16.0
soundfile==0.12.1
soxr==0.3.7
sympy==1.12
tabulate==0.9.0
tensorboard==2.17.0
tensorboard-data-server==0.7.2
threadpoolctl==3.5.0
torch==2.0.1
torchaudio==2.0.2
tqdm==4.66.4
triton==2.0.0
typeguard==4.3.0
typing==3.7.4.3
typing_extensions==4.12.2
tzdata==2024.1
Unidecode==1.3.8
uritemplate==4.1.1
urllib3==2.2.2
webrtcvad==2.0.10
Werkzeug==3.0.3
rmcpantoja commented 3 months ago

Hi, Make sure you have torchaudio installed properly, with its dependencies to work, or use a vocoder like hifigan or istft-based vocoders like vocos, vocoders are better than griffinlim, honestly.

Tony-Starkus commented 3 months ago

Hi @rmcpantoja , thanks for the reply.

About the torchaudio, the requirements.txt has torch>=1.2.0 and torchaudio==2.0.2. The torchaudio 2 is compatible with pytorch 2. This is why i installed torch==2.0.1

My objective is convert text to audio file, and looking on the gen_forward.py the griffinlim is the one that created a wav file. Do you know another way to do it? I tried many codes to convert .mel and .npy to wav but no success.

Reference: https://github.com/pytorch/audio/releases/tag/v2.0.2

rmcpantoja commented 3 months ago

Hi @rmcpantoja , thanks for the reply.

About the torchaudio, the requirements.txt has torch>=1.2.0 and torchaudio==2.0.2. The torchaudio 2 is compatible with pytorch 2. This is why i installed torch==2.0.1

My objective is convert text to audio file, and looking on the gen_forward.py the griffinlim is the one that created a wav file. Do you know another way to do it? I tried many codes to convert .mel and .npy to wav but no success.

Reference: https://github.com/pytorch/audio/releases/tag/v2.0.2

Hi, If you add hifigan to gen_forward's command line, the script will convert npy automatically, and you need to pass the npy to any vocoder. But, I have a script that synthesizes ForwardTacotron and HiFi-GAN at same time, directly, without passing files. We have also a GUI app supporting this TTS, see here

Tony-Starkus commented 3 months ago

I checked the code of tts-remix. Can you give a little explanation about how to use it?!

stavrosmachinima commented 3 months ago

Hey, I had the same issue. Fixed it with two lines on gen_forward.py. I created a PR about it.

rmcpantoja commented 3 months ago

I checked the code of tts-remix. Can you give a little explanation about how to use it?!

Hi, Just use the GUI using:

python tts_remix.py

The interphase will open. Just you need to put ForwardTacotron and HiFiGan checkpoints, something like: models models/forward models/forward/voicename models/forward/voicename/voicename.pt models/forward/voicename/vocoder-voicename.pt models/forward/voicename/vocoder-voicename.json

Tony-Starkus commented 3 months ago

Hey, I had the same issue. Fixed it with two lines on gen_forward.py. I created a PR about it.

Looks good, i am going to try it later, thanks!

Which python version are you using? Also can you share your pip freeze please?!

Tony-Starkus commented 3 months ago

I checked the code of tts-remix. Can you give a little explanation about how to use it?!

Hi, Just use the GUI using:

python tts_remix.py

The interphase will open. Just you need to put ForwardTacotron and HiFiGan checkpoints, something like: models models/forward models/forward/voicename models/forward/voicename/voicename.pt models/forward/voicename/vocoder-voicename.pt models/forward/voicename/vocoder-voicename.json

Got it, i will try this. Thanks!

stavrosmachinima commented 3 months ago

Hey, I had the same issue. Fixed it with two lines on gen_forward.py. I created a PR about it.

Looks good, i am going to try it later, thanks!

Which python version are you using? Also can you share your pip freeze please?!

Python 3.10 as you. We have some slight differences in pip freeze but they shouldn't matter.


absl-py==2.1.0
attrs==23.2.0
audioread==3.0.1
Babel==2.15.0
bibtexparser==2.0.0b7
certifi==2024.7.4
cffi==1.16.0
charset-normalizer==3.3.2
clldutils==3.22.2
cmake==3.30.1
colorama==0.4.6
colorlog==6.8.2
contourpy==1.2.1
csvw==3.3.0
cycler==0.12.1
Cython==3.0.10
dataclasses==0.6
decorator==5.1.1
dlinfo==1.2.1
filelock==3.15.4
fonttools==4.53.1
grpcio==1.65.1
idna==3.7
inflect==7.3.1
isodate==0.6.1
Jinja2==3.1.4
joblib==1.4.2
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
kiwisolver==1.4.5
language-tags==1.2.0
lazy_loader==0.4
librosa==0.10.0
lit==18.1.8
llvmlite==0.39.1
lxml==5.2.2
Markdown==3.6
MarkupSafe==2.1.5
matplotlib==3.9.1
more-itertools==10.3.0
mpmath==1.3.0
msgpack==1.0.8
networkx==3.3
numba==0.56.4
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
packaging==24.1
pandas==2.2.2
phonemizer==3.2.1
pillow==10.4.0
platformdirs==4.2.2
pooch==1.8.2
protobuf==4.25.4
pycparser==2.22
pylatexenc==2.10
pyparsing==3.1.2
python-dateutil==2.9.0.post0
pytz==2024.1
pyworld==0.3.4
PyYAML==6.0.1
rdflib==7.0.0
referencing==0.35.1
regex==2024.7.24
requests==2.32.3
Resemblyzer==0.1.3
rfc3986==1.5.0
rpds-py==0.19.1
scikit-learn==1.5.1
scipy==1.14.0
segments==2.2.1
six==1.16.0
soundfile==0.12.1
soxr==0.4.0
sympy==1.13.1
tabulate==0.9.0
tensorboard==2.17.0
tensorboard-data-server==0.7.2
threadpoolctl==3.5.0
torch==2.0.1
torchaudio==2.0.2
tqdm==4.66.4
triton==2.0.0
typeguard==4.3.0
typing==3.7.4.3
typing_extensions==4.12.2
tzdata==2024.1
Unidecode==1.3.8
uritemplate==4.1.1
urllib3==2.2.2
webrtcvad==2.0.10
Werkzeug==3.0.3