DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.4k stars 158 forks source link

How to read scorer output ? #137

Closed Ca-ressemble-a-du-fake closed 3 months ago

Ca-ressemble-a-du-fake commented 1 year ago

Hi,

I want to check what the scorer has to say about my dataset and why it is keeping only 77 samples out of 98 (which all sound ok to me).

But I don't know how to interpret its result. I listened to pointed samples and nothing was wrong, except repetitions (but transcribed in the labels) sometimes.

/home/caraduf/ToucanTTS/toucanenv/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot o
pen shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
100%|█████████████████████████████████████████████████████████| 87/87 [00:04<00:00, 18.72it/s]
Loss: 5.527 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00194.wav
Loss: 5.187 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00255.wav
Loss: 4.923 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00139.wav
Loss: 4.701 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00236.wav
Loss: 4.435 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00358.wav
Loss: 4.423 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00364.wav
Loss: 4.252 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00083.wav
Loss: 4.191 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00172.wav
Loss: 4.127 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00175.wav
Loss: 4.04 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00176.wav
Loss: 3.999 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00084.wav
Loss: 3.843 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00362.wav
Loss: 3.836 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00117.wav
Loss: 3.794 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00145.wav
Loss: 3.782 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00068.wav
Loss: 3.725 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00121.wav
Loss: 3.722 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00112.wav
Loss: 3.722 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00218.wav
Loss: 3.67 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00159.wav
Loss: 3.575 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00197.wav
Prepared a FastSpeech dataset with 82 datapoints in Corpora/LISA_32.
100%|█████████████████████████████████████████████████████████| 82/82 [00:15<00:00,  5.44it/s]
Loss: 1.819 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00193.wav
Loss: 1.798 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00358.wav
Loss: 1.788 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00363.wav
Loss: 1.787 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00303.wav
Loss: 1.768 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00139.wav
Loss: 1.758 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00194.wav
Loss: 1.74 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00204.wav
Loss: 1.66 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00121.wav
Loss: 1.646 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00360.wav
Loss: 1.576 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00114.wav
Loss: 1.572 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00201.wav
Loss: 1.554 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00068.wav
Loss: 1.541 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00362.wav
Loss: 1.514 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00113.wav
Loss: 1.512 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00142.wav
Loss: 1.506 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00300.wav
Loss: 1.499 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00070.wav
Loss: 1.479 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00002.wav
Loss: 1.462 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00178.wav
Loss: 1.435 - Path: /home/caraduf/Datasets/LISA_16kHz/wavs/00112.wav

Dataset updated!
Prepared a FastSpeech dataset with 77 datapoints in Corpora/LISA_32.
  0%|                                                                  | 0/77 [00:00<?, ?it/s]/home/caraduf/ToucanTTS/toucanenv/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters(). (Triggered internally at ../aten/src/ATen/native/cudnn/RNN.cpp:982.)
  return forward_call(*args, **kwargs)
100%|█████████████████████████████████████████████████████████| 77/77 [00:14<00:00,  5.21it/s]

Why are the datapoint number decreasing (87, 82, 77). Will I get better results setting this heuristic removal to False (because I will get 20 samples more ) ?

Flux9665 commented 1 year ago

Yes, the heuristic removal is meant for large scale datasets, which usually are either crowdsourced or automatically collected and split. There is usually a large protion of mistakes in them and TTS heavily relies on clean data. It's better to have cleaner data than more data above a certain number of samples (maybe above 1000 I'd say, but that's just a very rough estimate).

For your tiny dataset, it's most likely better to turn of this heuristic selection. I added something to the code that should not do it for too small datasets. If you see a sample in the scorer where the loss is much higher compared to the other samples in the top k, then there is usually something wrong with the sample (I often see cases of the loss being 3 times higher for one sample than for any other sample).

You also have to consider proper nouns, technical terms, mixed language use or codeswitching in those cases, because even though the speech and the text may look like they fit together, the phonemizer might not know how to phonemize some words and there is a mismatch between the speech and the phonemes, even though there wasn't one between the speech and the text. But if you are certain that all samples in your dataset are high quality (and for lees than 100 you can definitely check every single one manually), then the heuristic selection should most likely be turned off.

Ca-ressemble-a-du-fake commented 1 year ago

Ok thanks for your reply. There are some English terms sometimes but I transcribed them as they are prononced in French.

I'm not sure to understand how the scorer work. Does it transcribe the samples with stt and then compares to the transcriptions and finally outputs a levenstein norm ? If so it is not 100% reliable since even openai Whisper makes mistakes. Consequently in that case the heuristic removal should be disabled.

What do you think ?

Flux9665 commented 1 year ago

No, the alignment scorer ranks samples based on the loss of the aligner on that sample (CTC loss) and the TTS scorer ranks samples based on the loss of the TTS itself (L1 distance). Selecting based on CTC is not really well grounded because sequence length may have an impact, but if the loss is signifiantly larger than the average, there is usually some imperfection that goes beyond the variance expected by means of signal length. Manual inspection should prompt to further investigation of samples if their loss is double or even greater than the next highest loss in the ranking. In large datasets it's better to throw a thousand datapoints away if that means getting rid of just ten problematic ones. In small datasets, every single datapoint is important. So for small datasets, the selection should be disabled and the data should be manually checked. For large datasets, it should be enabled for most data-sources.

Ca-ressemble-a-du-fake commented 1 year ago

Ok thank you, it's like using a sledgehammer to crack a nut! I will prepare a new dataset and try your new model when it's ready!

Ca-ressemble-a-du-fake commented 1 year ago

That's really weird. I restarted the computer for another reason and now it keeps the 87 / 87 datapoints (without touching anything to the dataset). I may have screwed up the virtual environment (I use two venv simultaneously one for training and one for debugging and I may have updated Toucan only in one of it => a mess).