Closed AlexSteveChungAlvarez closed 5 months ago
I see, something must have changed in the dependency. Thanks for letting me know, I will try to fix it!
Can you try again with the new version of the requirements? I made sure the version of speechbrain I require has compatible syntax. If it still doesn't work, then something is wrong with the speechbrain package and I need to investigate that.
Can you try again with the new version of the requirements? I made sure the version of speechbrain I require has compatible syntax. If it still doesn't work, then something is wrong with the speechbrain package and I need to investigate that.
This is what shows up now:
Traceback (most recent call last):
File "/content/IMS-Toucan/run_text_to_file_reader.py", line 5, in <module>
from InferenceInterfaces.ToucanTTSInterface import ToucanTTSInterface
File "/content/IMS-Toucan/InferenceInterfaces/ToucanTTSInterface.py", line 17, in <module>
from speechbrain.pretrained import EncoderClassifier
File "/usr/local/lib/python3.10/dist-packages/speechbrain/__init__.py", line 4, in <module>
from .core import Stage, Brain, create_experiment_directory, parse_arguments
File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 36, in <module>
from speechbrain.utils.distributed import run_on_main
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/__init__.py", line 11, in <module>
from . import * # noqa
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/train_logger.py", line 231, in <module>
class ProgressSampleLogger:
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/train_logger.py", line 300, in ProgressSampleLogger
"saver": _get_image_saver(),
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/train_logger.py", line 223, in _get_image_saver
import torchvision
File "/usr/local/lib/python3.10/dist-packages/torchvision/__init__.py", line 6, in <module>
from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils
File "/usr/local/lib/python3.10/dist-packages/torchvision/_meta_registrations.py", line 164, in <module>
def meta_nms(dets, scores, iou_threshold):
File "/usr/local/lib/python3.10/dist-packages/torch/_custom_ops.py", line 253, in inner
custom_op = _find_custom_op(qualname, also_check_torch_library=True)
File "/usr/local/lib/python3.10/dist-packages/torch/_custom_op/impl.py", line 1076, in _find_custom_op
overload = get_op(qualname)
File "/usr/local/lib/python3.10/dist-packages/torch/_custom_op/impl.py", line 1062, in get_op
error_not_found()
File "/usr/local/lib/python3.10/dist-packages/torch/_custom_op/impl.py", line 1052, in error_not_found
raise ValueError(
ValueError: Could not find the operator torchvision::nms. Please make sure you have already registered the operator and (if registered from C++) loaded it via torch.ops.load_library.
This seems like you updated torch and torchaudio libraries in the requirements, but not torchvision, and since colab comes with torchvision, there should be some issue there.
hey @AlexSteveChungAlvarez
just try install this
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118 -U
and must working i test and working fine in windows 10 with python 3.10.0
@lpscr I guess it won't work on colab, after running your solution (and trying also with the version of pytorch stated in the requirements) I still get the first error of the "NotImplementedError: Only 2D, 3D, 4D, 5D padding with non-constant padding are supported for now". Update: I've also tried your solution on Windows, in a fresh environment with the specifications you gave on #171 and is throwing the same error.
@AlexSteveChungAlvarez
here full code in colab working fine
maybe problem the audio you upload ? try use wav file let me know
just copy paste in first and second cell run the code in first cell you get message something Restart session just click cancel and then run the cell 2
first cell
import glob
import IPython.display as ipd
import os
!git clone https://github.com/DigitalPhonetics/IMS-Toucan.git
%cd IMS-Toucan
!pip install dragonmapper pypinyin wandb dotwiz pyloudnorm einops speechbrain==0.5.13 torch_complex praat-parselmouth transphone jamo g2pk
!python run_model_downloader.py
!pip install gradio
!apt-get -y install python-espeak
!apt-get -y install espeak-ng
!pip install py-espeak-ng -y
!pip install phonemizer
!pip install sounddevice
!apt-get install libportaudio2
!pip install audioseal
filename = None
When you run this code and also you check 'upload_new_audio,' the button . You must upload audio first before generating text. If you don't need to upload new audio and only want to generate text, simply uncheck 'upload_new_audio.'
second cell
import os
import warnings
from InferenceInterfaces.ToucanTTSInterface import ToucanTTSInterface
from Utility.storage_config import MODELS_DIR
import numpy as np
import IPython.display as ipd
import soundfile as sf
from google.colab import files
upload_new_audio = True # @param {type:"boolean"}
text="hi how are you today" # @param {type:"string"}
lang_id="eng" # @param {type:"string"}
if upload_new_audio:
uploaded = files.upload()
for filename in uploaded.keys():
with open(filename, 'wb') as f:
f.write(uploaded[filename])
warnings.filterwarnings("ignore", category=UserWarning)
device = "cpu"
file_model = os.path.join(MODELS_DIR, "ToucanTTS_Meta", "best.pt")
tts = ToucanTTSInterface(device=device, tts_model_path=file_model)
tts.set_language(lang_id=lang_id)
if filename is not None:
wav_ref=filename
tts.set_utterance_embedding(wav_ref)
tts.read_to_file([text],"output.wav", duration_scaling_factor=1.0,energy_variance_scale=1.0,pitch_variance_scale=1.0,glow_sampling_temperature=0.2)
print("Ref")
ipd.display(ipd.Audio(wav_ref))
print("Gen")
ipd.display(ipd.Audio("output.wav"))
here how look like
Yes, I got rid of torchvision as requirement entirely. Maybe just try uninstalling torchvision from your colab before installing the toucan dependencies?
Still having the same error on both Windows and colab:
torchvision is not available - cannot save figures
running on cuda
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
/usr/local/lib/python3.10/dist-packages/torch/functional.py:650: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:863.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/speechbrain/lobes/models/ECAPA_TDNN.py", line 488, in forward
x = layer(x, lengths=lengths)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
TypeError: TDNNBlock.forward() got an unexpected keyword argument 'lengths'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/content/IMS-Toucan/run_text_to_file_reader.py", line 97, in <module>
read_texts(model_id="Meta",
File "/content/IMS-Toucan/run_text_to_file_reader.py", line 12, in read_texts
tts.set_utterance_embedding(speaker_reference)
File "/content/IMS-Toucan/InferenceInterfaces/ToucanTTSInterface.py", line 108, in set_utterance_embedding
speaker_embedding = self.speaker_embedding_func_ecapa.encode_batch(wavs=wave.to(self.device).unsqueeze(0)).squeeze()
File "/usr/local/lib/python3.10/dist-packages/speechbrain/pretrained/interfaces.py", line 830, in encode_batch
embeddings = self.mods.embedding_model(feats, wav_lens)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/speechbrain/lobes/models/ECAPA_TDNN.py", line 490, in forward
x = layer(x)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/speechbrain/lobes/models/ECAPA_TDNN.py", line 81, in forward
return self.norm(self.activation(self.conv(x)))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/speechbrain/nnet/CNN.py", line 420, in forward
x = self._manage_padding(
File "/usr/local/lib/python3.10/dist-packages/speechbrain/nnet/CNN.py", line 472, in _manage_padding
x = F.pad(x, padding, mode=self.padding_mode)
NotImplementedError: Only 2D, 3D, 4D, 5D padding with non-constant padding are supported for now
I've just tried on the Windows side running the CLI demo, it works without reference speaker, so as my issue says the problem is when passing a reference_speaker.
maybe problem the audio you upload ? try use wav file let me know
@lpscr I have used both .wav and .mp3 files. I've used these same files with Toucan v1, I have even tried to put wav audios in the same directory as in the code ("merged_speaker_references") to try "sound_of_silence_single_utt" function, but it just throws the same error... @Flux9665 the problem could be an updated function from a dependency used in the "set_utterance_embed" function maybe?
That's what I thought, so I changed the required version of speechbrain from ~= to ==, but it didn't fix the issue for you. I'm not sure where this issue is coming from, since I cannot reproduce it, which makes it very hard to figure out.
Can you double check which version of speechbrain you have installed? And if it is 0.5.13, as the requirements demand, can you check if an earlier or later version from this list fixes the problem? https://pypi.org/project/speechbrain/#history
in colab and windows working fine i test both with the LJ-Speech sample test i am not sure why you get this error maybe the duraction of the audio small or big or need resample i am not sure if i dont see the audio file because working fine for me both here i test
windows
with the code i give first need create folder miss create first folder in ims-toucn/audios /speaker_references put the file inside you need to drag this time in folder ims-toucn/audios/speaker_references then working fine
then in make new cell and run this
!python run_text_to_file_reader.py
colab
I had the same error: the problem is in the format of the reference wav. In my case it was the fact that it wasn't mono but 5.1. Once converted to mono the error disappeared (docker linux under w11)
That's an easy fix then, I just added librosa.to_mono after the audio is loaded. Thanks @fcrescio for figuring it out!
@AlexSteveChungAlvarez please test with the most recent commit and let me know if it works now.
just quick question @Flux9665 because also i was think covert automatic ref audio that's why i dont say something to covert in mono and also the sample rate must be in 24000 or 16000 ? to get best ref
also can use emotion audio for example if i use audio speak sad , or happy to get speak like this ?
This is the wav audio I'm using as reference https://drive.google.com/file/d/17_zvpvStcU3Yix2hDiPNTwRWqeuLRAfC/view?usp=sharing Now this error is raised:
running on cuda
d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\functional.py:650: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ..\aten\src\ATen\native\SpectralOps.cpp:868.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
Traceback (most recent call last):
File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\speechbrain\lobes\models\ECAPA_TDNN.py", line 488, in forward
x = layer(x, lengths=lengths)
File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
TypeError: TDNNBlock.forward() got an unexpected keyword argument 'lengths'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "d:\IMS-Toucan\run_text_to_file_reader.py", line 100, in <module>
sound_of_silence_single_utt(version="new_voc",
File "d:\IMS-Toucan\run_text_to_file_reader.py", line 42, in sound_of_silence_single_utt
read_texts(model_id=model_id,
File "d:\IMS-Toucan\run_text_to_file_reader.py", line 12, in read_texts
tts.set_utterance_embedding(speaker_reference)
File "d:\IMS-Toucan\InferenceInterfaces\ToucanTTSInterface.py", line 110, in set_utterance_embedding
speaker_embedding = self.speaker_embedding_func_ecapa.encode_batch(wavs=wave.to(self.device).squeeze().unsqueeze(0)).squeeze()
File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\speechbrain\pretrained\interfaces.py", line 830, in encode_batch
embeddings = self.mods.embedding_model(feats, wav_lens)
File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\speechbrain\lobes\models\ECAPA_TDNN.py", line 490, in forward
x = layer(x)
File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\speechbrain\lobes\models\ECAPA_TDNN.py", line 81, in forward
return self.norm(self.activation(self.conv(x)))
File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\speechbrain\nnet\CNN.py", line 420, in forward
x = self._manage_padding(
File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\speechbrain\nnet\CNN.py", line 472, in _manage_padding
x = F.pad(x, padding, mode=self.padding_mode)
RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (2, 2) at dimension 2 of input [1, 80, 1]
Can you double check which version of speechbrain you have installed? And if it is 0.5.13, as the requirements demand, can you check if an earlier or later version from this list fixes the problem? https://pypi.org/project/speechbrain/#history
Yes @Flux9665, speechbrain is 0.5.13, I've just tried with later versions until the last released, all throw the same error. If I downgrade the version, then audioseal and cuda would need to be downgraded too, to a version compatible with pytorch <1.13.
just quick question @Flux9665 because also i was think covert automatic ref audio that's why i dont say something to covert in mono and also the sample rate must be in 24000 or 16000 ? to get best ref
There's a resample step which automatically turns the audio to 16kHz, so it doesn't matter the input sample rate.
also can use emotion audio for example if i use audio speak sad , or happy to get speak like this ?
Version 1 didn't have emotion controllability, and in version 2 demo it isn't neither. You would have to finetune the model on these specific emotions datasets or add a "emotion predictor" net to the model which would enable you to control this as you want @lpscr. You could also play with the pitch, energy and speed predictors to map specific combinations of them to an emotion maybe.
hi @Flux9665 not working you still need covert manual the file to work i use pydub see in cell2
@AlexSteveChungAlvarez here i make new update in to covert in mono to 16000 hz**
cell1
import glob
import IPython.display as ipd
import os
!git clone https://github.com/DigitalPhonetics/IMS-Toucan.git
%cd IMS-Toucan
!pip install dragonmapper pypinyin wandb dotwiz pyloudnorm einops speechbrain==0.5.13 torch_complex praat-parselmouth transphone jamo g2pk
!python run_model_downloader.py
!pip install gradio
!apt-get -y install python-espeak
!apt-get -y install espeak-ng
!pip install py-espeak-ng -y
!pip install phonemizer
!pip install sounddevice
!apt-get install libportaudio2
!pip install audioseal
!pip install pydub
filename = None
cell2
import os
import warnings
from InferenceInterfaces.ToucanTTSInterface import ToucanTTSInterface
from Utility.storage_config import MODELS_DIR
import numpy as np
import IPython.display as ipd
import soundfile as sf
from google.colab import files
from pydub import AudioSegment
upload_new_audio = True # @param {type:"boolean"}
text="hi how are you today" # @param {type:"string"}
lang_id="eng" # @param {type:"string"}
if upload_new_audio:
uploaded = files.upload()
for filename in uploaded.keys():
with open(filename, 'wb') as f:
f.write(uploaded[filename])
#cover audio to mono in 16000 sample rate
sound = AudioSegment.from_file(filename, format="wav")
sound = sound.set_channels(1)
sound = sound.set_frame_rate(16000)
sound.export(filename, format="wav")
warnings.filterwarnings("ignore", category=UserWarning)
device = "cpu"
file_model = os.path.join(MODELS_DIR, "ToucanTTS_Meta", "best.pt")
tts = ToucanTTSInterface(device=device, tts_model_path=file_model)
tts.set_language(lang_id=lang_id)
if filename is not None:
wav_ref=filename
tts.set_utterance_embedding(wav_ref)
tts.read_to_file([text],"output.wav", duration_scaling_factor=1.0,energy_variance_scale=1.0,pitch_variance_scale=1.0,glow_sampling_temperature=0.2)
print("Ref")
ipd.display(ipd.Audio(wav_ref))
print("Gen")
ipd.display(ipd.Audio("output.wav"))
This is the wav audio I'm using as reference https://drive.google.com/file/d/17_zvpvStcU3Yix2hDiPNTwRWqeuLRAfC/view?usp=sharing Now this error is raised:
@Flux9665 were you able to test with my audio sample?
As mentioned above, it needs to be converted from stereo to mono, then it works.
The problem is that for some reason the time axis and the channel axis in this audio are switched. I wrote a check to detect this and switch the axes if that's the case, so to_mono gets the shape it expects.
Thank you @Flux9665 , now it works on windows well! Congratulations for the new version of Toucan, I saw you applied in the end the AdaIn layer I talked to you about last year and were able not only to include Quechua, but Aymara too among the languages, which I mentioned you by email I was working with them last year too. It's impressive and ironic that you found an Aymara speaker being far away from the origin region of the language, while I haven't found one being closer. What surprised me about this release was that your dataset is not in the best quality, as you suggested me once giving LibrittsR as example of high quality data I could play with, but this topic is for a later discussion, maybe you can open in the repository a Q&As section for this.
Luckily my collaborators in that paper have worked with the Aymara language before and have contact to speakers, so we could get their help in the evaluation. And yes, the quality is still a big issue. In the next week I hope I can release an updated version that hopefully sounds a bit better, alongside other features. Mismatches in the labels, incorrect alignments and other problems in this large cascade add up and cause major problems if the data used is not of super high quality and the phonemizer performs near perfect on the given language.
I already tried passing a .wav and a .mp3 file, in both cases this error raises: