DigitalPhonetics / IMS-Toucan

Multilingual and Controllable Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart.
Apache License 2.0
1.17k stars 135 forks source link

Error when passing a reference_speaker in read_texts function after installing Toucan as described in issue #171 #172

Closed AlexSteveChungAlvarez closed 3 days ago

AlexSteveChungAlvarez commented 1 week ago

I already tried passing a .wav and a .mp3 file, in both cases this error raises:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/lobes/models/ECAPA_TDNN.py", line 486, in forward
    x = layer(x, lengths=lengths)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: TDNNBlock.forward() got an unexpected keyword argument 'lengths'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/IMS-Toucan/run_text_to_file_reader.py", line 99, in <module>
    read_texts(model_id="Meta",
  File "/content/IMS-Toucan/run_text_to_file_reader.py", line 12, in read_texts
    tts.set_utterance_embedding(speaker_reference)
  File "/content/IMS-Toucan/InferenceInterfaces/ToucanTTSInterface.py", line 108, in set_utterance_embedding
    speaker_embedding = self.speaker_embedding_func_ecapa.encode_batch(wavs=wave.to(self.device).unsqueeze(0)).squeeze()
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/inference/classifiers.py", line 110, in encode_batch
    embeddings = self.mods.embedding_model(feats, wav_lens)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/lobes/models/ECAPA_TDNN.py", line 488, in forward
    x = layer(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/lobes/models/ECAPA_TDNN.py", line 80, in forward
    return self.norm(self.activation(self.conv(x)))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/nnet/CNN.py", line 428, in forward
    x = self._manage_padding(
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/nnet/CNN.py", line 480, in _manage_padding
    x = F.pad(x, padding, mode=self.padding_mode)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 4522, in pad
    return torch._C._nn.pad(input, pad, mode, value)
NotImplementedError: Only 2D, 3D, 4D, 5D padding with non-constant padding are supported for now
Flux9665 commented 1 week ago

I see, something must have changed in the dependency. Thanks for letting me know, I will try to fix it!

Flux9665 commented 1 week ago

Can you try again with the new version of the requirements? I made sure the version of speechbrain I require has compatible syntax. If it still doesn't work, then something is wrong with the speechbrain package and I need to investigate that.

AlexSteveChungAlvarez commented 1 week ago

Can you try again with the new version of the requirements? I made sure the version of speechbrain I require has compatible syntax. If it still doesn't work, then something is wrong with the speechbrain package and I need to investigate that.

This is what shows up now:

Traceback (most recent call last):
  File "/content/IMS-Toucan/run_text_to_file_reader.py", line 5, in <module>
    from InferenceInterfaces.ToucanTTSInterface import ToucanTTSInterface
  File "/content/IMS-Toucan/InferenceInterfaces/ToucanTTSInterface.py", line 17, in <module>
    from speechbrain.pretrained import EncoderClassifier
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/__init__.py", line 4, in <module>
    from .core import Stage, Brain, create_experiment_directory, parse_arguments
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 36, in <module>
    from speechbrain.utils.distributed import run_on_main
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/__init__.py", line 11, in <module>
    from . import *  # noqa
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/train_logger.py", line 231, in <module>
    class ProgressSampleLogger:
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/train_logger.py", line 300, in ProgressSampleLogger
    "saver": _get_image_saver(),
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/train_logger.py", line 223, in _get_image_saver
    import torchvision
  File "/usr/local/lib/python3.10/dist-packages/torchvision/__init__.py", line 6, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils
  File "/usr/local/lib/python3.10/dist-packages/torchvision/_meta_registrations.py", line 164, in <module>
    def meta_nms(dets, scores, iou_threshold):
  File "/usr/local/lib/python3.10/dist-packages/torch/_custom_ops.py", line 253, in inner
    custom_op = _find_custom_op(qualname, also_check_torch_library=True)
  File "/usr/local/lib/python3.10/dist-packages/torch/_custom_op/impl.py", line 1076, in _find_custom_op
    overload = get_op(qualname)
  File "/usr/local/lib/python3.10/dist-packages/torch/_custom_op/impl.py", line 1062, in get_op
    error_not_found()
  File "/usr/local/lib/python3.10/dist-packages/torch/_custom_op/impl.py", line 1052, in error_not_found
    raise ValueError(
ValueError: Could not find the operator torchvision::nms. Please make sure you have already registered the operator and (if registered from C++) loaded it via torch.ops.load_library.

This seems like you updated torch and torchaudio libraries in the requirements, but not torchvision, and since colab comes with torchvision, there should be some issue there.

lpscr commented 1 week ago

hey @AlexSteveChungAlvarez
just try install this

pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118 -U

and must working i test and working fine in windows 10 with python 3.10.0

AlexSteveChungAlvarez commented 1 week ago

@lpscr I guess it won't work on colab, after running your solution (and trying also with the version of pytorch stated in the requirements) I still get the first error of the "NotImplementedError: Only 2D, 3D, 4D, 5D padding with non-constant padding are supported for now". Update: I've also tried your solution on Windows, in a fresh environment with the specifications you gave on #171 and is throwing the same error.

lpscr commented 1 week ago

@AlexSteveChungAlvarez

here full code in colab working fine

maybe problem the audio you upload ? try use wav file let me know

just copy paste in first and second cell run the code in first cell you get message something Restart session just click cancel and then run the cell 2

first cell

import glob
import IPython.display as ipd
import os

!git clone https://github.com/DigitalPhonetics/IMS-Toucan.git
%cd IMS-Toucan

!pip install dragonmapper pypinyin wandb dotwiz pyloudnorm einops speechbrain==0.5.13 torch_complex praat-parselmouth transphone jamo g2pk
!python run_model_downloader.py
!pip install gradio

!apt-get -y install python-espeak
!apt-get -y install espeak-ng
!pip install  py-espeak-ng -y
!pip install phonemizer

!pip install sounddevice
!apt-get install libportaudio2
!pip install audioseal

filename = None

When you run this code and also you check 'upload_new_audio,' the button . You must upload audio first before generating text. If you don't need to upload new audio and only want to generate text, simply uncheck 'upload_new_audio.'

second cell

import os
import warnings
from InferenceInterfaces.ToucanTTSInterface import ToucanTTSInterface
from Utility.storage_config import MODELS_DIR

import numpy as np
import IPython.display as ipd
import soundfile as sf

from google.colab import files

upload_new_audio = True # @param {type:"boolean"}
text="hi how are you today" # @param {type:"string"}
lang_id="eng"  # @param {type:"string"}

if upload_new_audio:
   uploaded = files.upload()
   for filename in uploaded.keys():
       with open(filename, 'wb') as f:
            f.write(uploaded[filename])

warnings.filterwarnings("ignore", category=UserWarning)
device = "cpu"

file_model = os.path.join(MODELS_DIR, "ToucanTTS_Meta", "best.pt")
tts = ToucanTTSInterface(device=device, tts_model_path=file_model)
tts.set_language(lang_id=lang_id)

if filename  is not None:
   wav_ref=filename
   tts.set_utterance_embedding(wav_ref)

tts.read_to_file([text],"output.wav", duration_scaling_factor=1.0,energy_variance_scale=1.0,pitch_variance_scale=1.0,glow_sampling_temperature=0.2)
print("Ref")
ipd.display(ipd.Audio(wav_ref))
print("Gen")
ipd.display(ipd.Audio("output.wav"))

here how look like image

Flux9665 commented 1 week ago

Yes, I got rid of torchvision as requirement entirely. Maybe just try uninstalling torchvision from your colab before installing the toucan dependencies?

AlexSteveChungAlvarez commented 1 week ago

Still having the same error on both Windows and colab:

torchvision is not available - cannot save figures
running on cuda
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
/usr/local/lib/python3.10/dist-packages/torch/functional.py:650: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:863.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/lobes/models/ECAPA_TDNN.py", line 488, in forward
    x = layer(x, lengths=lengths)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: TDNNBlock.forward() got an unexpected keyword argument 'lengths'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/IMS-Toucan/run_text_to_file_reader.py", line 97, in <module>
    read_texts(model_id="Meta",
  File "/content/IMS-Toucan/run_text_to_file_reader.py", line 12, in read_texts
    tts.set_utterance_embedding(speaker_reference)
  File "/content/IMS-Toucan/InferenceInterfaces/ToucanTTSInterface.py", line 108, in set_utterance_embedding
    speaker_embedding = self.speaker_embedding_func_ecapa.encode_batch(wavs=wave.to(self.device).unsqueeze(0)).squeeze()
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/pretrained/interfaces.py", line 830, in encode_batch
    embeddings = self.mods.embedding_model(feats, wav_lens)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/lobes/models/ECAPA_TDNN.py", line 490, in forward
    x = layer(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/lobes/models/ECAPA_TDNN.py", line 81, in forward
    return self.norm(self.activation(self.conv(x)))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/nnet/CNN.py", line 420, in forward
    x = self._manage_padding(
  File "/usr/local/lib/python3.10/dist-packages/speechbrain/nnet/CNN.py", line 472, in _manage_padding
    x = F.pad(x, padding, mode=self.padding_mode)
NotImplementedError: Only 2D, 3D, 4D, 5D padding with non-constant padding are supported for now

I've just tried on the Windows side running the CLI demo, it works without reference speaker, so as my issue says the problem is when passing a reference_speaker.

AlexSteveChungAlvarez commented 1 week ago

maybe problem the audio you upload ? try use wav file let me know

@lpscr I have used both .wav and .mp3 files. I've used these same files with Toucan v1, I have even tried to put wav audios in the same directory as in the code ("merged_speaker_references") to try "sound_of_silence_single_utt" function, but it just throws the same error... @Flux9665 the problem could be an updated function from a dependency used in the "set_utterance_embed" function maybe?

Flux9665 commented 1 week ago

That's what I thought, so I changed the required version of speechbrain from ~= to ==, but it didn't fix the issue for you. I'm not sure where this issue is coming from, since I cannot reproduce it, which makes it very hard to figure out.

Can you double check which version of speechbrain you have installed? And if it is 0.5.13, as the requirements demand, can you check if an earlier or later version from this list fixes the problem? https://pypi.org/project/speechbrain/#history

lpscr commented 1 week ago

in colab and windows working fine i test both with the LJ-Speech sample test i am not sure why you get this error maybe the duraction of the audio small or big or need resample i am not sure if i dont see the audio file because working fine for me both here i test

windows image

with the code i give first need create folder miss create first folder in ims-toucn/audios /speaker_references put the file inside you need to drag this time in folder ims-toucn/audios/speaker_references then working fine

then in make new cell and run this !python run_text_to_file_reader.py

colab

image

fcrescio commented 1 week ago

I had the same error: the problem is in the format of the reference wav. In my case it was the fact that it wasn't mono but 5.1. Once converted to mono the error disappeared (docker linux under w11)

Flux9665 commented 1 week ago

That's an easy fix then, I just added librosa.to_mono after the audio is loaded. Thanks @fcrescio for figuring it out!

@AlexSteveChungAlvarez please test with the most recent commit and let me know if it works now.

lpscr commented 1 week ago

just quick question @Flux9665 because also i was think covert automatic ref audio that's why i dont say something to covert in mono and also the sample rate must be in 24000 or 16000 ? to get best ref

also can use emotion audio for example if i use audio speak sad , or happy to get speak like this ?

AlexSteveChungAlvarez commented 1 week ago

This is the wav audio I'm using as reference https://drive.google.com/file/d/17_zvpvStcU3Yix2hDiPNTwRWqeuLRAfC/view?usp=sharing Now this error is raised:

running on cuda
d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\functional.py:650: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ..\aten\src\ATen\native\SpectralOps.cpp:868.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\speechbrain\lobes\models\ECAPA_TDNN.py", line 488, in forward
    x = layer(x, lengths=lengths)
  File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: TDNNBlock.forward() got an unexpected keyword argument 'lengths'

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "d:\IMS-Toucan\run_text_to_file_reader.py", line 100, in <module>
    sound_of_silence_single_utt(version="new_voc",
  File "d:\IMS-Toucan\run_text_to_file_reader.py", line 42, in sound_of_silence_single_utt
    read_texts(model_id=model_id,
  File "d:\IMS-Toucan\run_text_to_file_reader.py", line 12, in read_texts
    tts.set_utterance_embedding(speaker_reference)
  File "d:\IMS-Toucan\InferenceInterfaces\ToucanTTSInterface.py", line 110, in set_utterance_embedding
    speaker_embedding = self.speaker_embedding_func_ecapa.encode_batch(wavs=wave.to(self.device).squeeze().unsqueeze(0)).squeeze()
  File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\speechbrain\pretrained\interfaces.py", line 830, in encode_batch     
    embeddings = self.mods.embedding_model(feats, wav_lens)
  File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl        
    return self._call_impl(*args, **kwargs)
  File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\speechbrain\lobes\models\ECAPA_TDNN.py", line 490, in forward        
    x = layer(x)
  File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl        
    return self._call_impl(*args, **kwargs)
  File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\speechbrain\lobes\models\ECAPA_TDNN.py", line 81, in forward
    return self.norm(self.activation(self.conv(x)))
  File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl        
    return self._call_impl(*args, **kwargs)
  File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\speechbrain\nnet\CNN.py", line 420, in forward
    x = self._manage_padding(
  File "d:\IMS-Toucan\.toucan_multi\lib\site-packages\speechbrain\nnet\CNN.py", line 472, in _manage_padding
    x = F.pad(x, padding, mode=self.padding_mode)
RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (2, 2) at dimension 2 of input [1, 80, 1]

Can you double check which version of speechbrain you have installed? And if it is 0.5.13, as the requirements demand, can you check if an earlier or later version from this list fixes the problem? https://pypi.org/project/speechbrain/#history

Yes @Flux9665, speechbrain is 0.5.13, I've just tried with later versions until the last released, all throw the same error. If I downgrade the version, then audioseal and cuda would need to be downgraded too, to a version compatible with pytorch <1.13.

AlexSteveChungAlvarez commented 1 week ago

just quick question @Flux9665 because also i was think covert automatic ref audio that's why i dont say something to covert in mono and also the sample rate must be in 24000 or 16000 ? to get best ref

There's a resample step which automatically turns the audio to 16kHz, so it doesn't matter the input sample rate.

also can use emotion audio for example if i use audio speak sad , or happy to get speak like this ?

Version 1 didn't have emotion controllability, and in version 2 demo it isn't neither. You would have to finetune the model on these specific emotions datasets or add a "emotion predictor" net to the model which would enable you to control this as you want @lpscr. You could also play with the pitch, energy and speed predictors to map specific combinations of them to an emotion maybe.

lpscr commented 1 week ago

hi @Flux9665 not working you still need covert manual the file to work i use pydub see in cell2

@AlexSteveChungAlvarez here i make new update in to covert in mono to 16000 hz**

cell1

import glob
import IPython.display as ipd
import os

!git clone https://github.com/DigitalPhonetics/IMS-Toucan.git
%cd IMS-Toucan

!pip install dragonmapper pypinyin wandb dotwiz pyloudnorm einops speechbrain==0.5.13 torch_complex praat-parselmouth transphone jamo g2pk
!python run_model_downloader.py
!pip install gradio

!apt-get -y install python-espeak
!apt-get -y install espeak-ng
!pip install  py-espeak-ng -y
!pip install phonemizer

!pip install sounddevice
!apt-get install libportaudio2
!pip install audioseal
!pip install pydub

filename = None

cell2

import os
import warnings
from InferenceInterfaces.ToucanTTSInterface import ToucanTTSInterface
from Utility.storage_config import MODELS_DIR

import numpy as np
import IPython.display as ipd
import soundfile as sf

from google.colab import files
from pydub import AudioSegment

upload_new_audio = True # @param {type:"boolean"}
text="hi how are you today" # @param {type:"string"}
lang_id="eng"  # @param {type:"string"}

if upload_new_audio:
   uploaded = files.upload()

   for filename in uploaded.keys():
       with open(filename, 'wb') as f:
            f.write(uploaded[filename])

       #cover audio to mono in 16000 sample rate 
       sound = AudioSegment.from_file(filename, format="wav")
       sound = sound.set_channels(1)
       sound = sound.set_frame_rate(16000)
       sound.export(filename, format="wav")

warnings.filterwarnings("ignore", category=UserWarning)
device = "cpu"

file_model = os.path.join(MODELS_DIR, "ToucanTTS_Meta", "best.pt")
tts = ToucanTTSInterface(device=device, tts_model_path=file_model)
tts.set_language(lang_id=lang_id)

if filename  is not None:
   wav_ref=filename
   tts.set_utterance_embedding(wav_ref)

tts.read_to_file([text],"output.wav", duration_scaling_factor=1.0,energy_variance_scale=1.0,pitch_variance_scale=1.0,glow_sampling_temperature=0.2)
print("Ref")
ipd.display(ipd.Audio(wav_ref))
print("Gen")
ipd.display(ipd.Audio("output.wav"))
AlexSteveChungAlvarez commented 3 days ago

This is the wav audio I'm using as reference https://drive.google.com/file/d/17_zvpvStcU3Yix2hDiPNTwRWqeuLRAfC/view?usp=sharing Now this error is raised:

@Flux9665 were you able to test with my audio sample?

Flux9665 commented 3 days ago

As mentioned above, it needs to be converted from stereo to mono, then it works.

The problem is that for some reason the time axis and the channel axis in this audio are switched. I wrote a check to detect this and switch the axes if that's the case, so to_mono gets the shape it expects.

AlexSteveChungAlvarez commented 3 days ago

Thank you @Flux9665 , now it works on windows well! Congratulations for the new version of Toucan, I saw you applied in the end the AdaIn layer I talked to you about last year and were able not only to include Quechua, but Aymara too among the languages, which I mentioned you by email I was working with them last year too. It's impressive and ironic that you found an Aymara speaker being far away from the origin region of the language, while I haven't found one being closer. What surprised me about this release was that your dataset is not in the best quality, as you suggested me once giving LibrittsR as example of high quality data I could play with, but this topic is for a later discussion, maybe you can open in the repository a Q&As section for this.

Flux9665 commented 3 days ago

Luckily my collaborators in that paper have worked with the Aymara language before and have contact to speakers, so we could get their help in the evaluation. And yes, the quality is still a big issue. In the next week I hope I can release an updated version that hopefully sounds a bit better, alongside other features. Mismatches in the labels, incorrect alignments and other problems in this large cascade add up and cause major problems if the data used is not of super high quality and the phonemizer performs near perfect on the given language.