e-c-k-e-r / vall-e

An unofficial PyTorch implementation of VALL-E
GNU Affero General Public License v3.0
78 stars 7 forks source link

Emilia dataset #2

Open kunibald413 opened 2 months ago

kunibald413 commented 2 months ago

have you seen this dataset? maybe it's better suited for zero-shot task, more natural speech than audiobook

https://github.com/open-mmlab/Amphion/blob/main/preprocessors/Emilia/README.md

e-c-k-e-r commented 2 months ago

101k hours

Hot damn. I think my own collection caps out at ~14K hours between pieces of LibriSpeech and audiobooks, and a very, very small smattering from videogames. Even the smaller languages being >1K hours is a huge help, since the biggest hurdle for me was trying to even find a large enough corpus to piece together for my own dataset for just Japanese.

2.4TB

Daunting, but I'll see if I have some spare spinning rust I can store it and pick at it. If anything I might just start with the smaller languages to squeeze out some multi-lingual-ness that I keep meaning to go about doing.

There being transcriptions already help a ton, as a half of the pain with audio processing is the time waiting to transcribe. The other half having to quantize it all through EnCodec is a bit of a boon still, but I think my recent setup of being able to split across GPUs should help.


Having more audio in general, and especially non-monotonous utterances, should help a ton. I'm pretty sure I already hit immense diminishing returns after an epoch.

Appreciated. I'll see what I can pick at over the next few days while my spark hasn't waned yet once again.

kunibald413 commented 2 months ago

doing large preps and trains can be soul-crushing. passion and a little bit of compulsion can keep you going, but i hope you don't burn yourself out.

e-c-k-e-r commented 2 months ago

doing large preps and trains can be soul-crushing.

It's not so bad this go-around. It used to be agonizing with system instability (segfaults or hard reboots with anything under PyTorch) on my original training system with my 4070Ti. Swapping to my 7900XTX almost entirely resolved the problems for dataset preparation and non-important training.

The estimated week-and-a-half wait for the dataset to process is always a good time for any last-minute checks or ideas to get added; for example: something to solve my dillema of subpar zero-shot performance that may stem from my naive prompt sampling (for months I've been entertaining the idea of using a vector store for input prompts to sample the closest utterance for a given training sample).

passion and a little bit of compulsion can keep you going, but i hope you don't burn yourself out.

I think I'm beyond my hyperfixate-then-burnout-then-hyperfixate cycles I've kept getting myself under, just seems to be lulls in progress until I grow a wild hair and fiddle with a new idea (for example, the STT task being quite promising for reducing the loss overall, so I hope putting emphasis on more languages would in turn help the model overall). The urgency and dread of trying to push out a decent model went away by the time I pushed out a quasi-decent ar+nar-llama-8 model with my 4xV100 system: it's rather quick to churn out a model with good results, and I don't need to keep bruteforcing additional epochs for very little gains like I did with the ar+nar-retnet-8.


Although again that 50K hours of English audio is still daunting. I think the best approach is to download the N-smallest .tars to prioritize more speakers over speakers with a lot of utterances (which unfortunately a lot of datasets seem to prioritize the latter over the former).

e-c-k-e-r commented 2 months ago

Should have everything prepared for the next training session. Detailing the below mostly for my sake to have it in writing somewhere:

As for the actual dataset to train against, I think I'm going to:

I'll just resume training from the existing ar+nar-tts+stt-llama-8 weights since I don't think I need to restart from scratch (as the model is still rather malleable from all the other tweaks I've glued on), but have the same dataset sampling method of "sort by duration with a fix batch size, and let it go from smallest to largest utterances" (as I still do not trust my batch-by-duration-size method to be stable).

My hope is that the Emilia dataset is the answer to my problems. As time goes on I'm more pilled on having a very, very good dataset rather than a large one, and I feel the big crux of these TTS systems are having a sloppy dataset. If results looks promising then that'll be my push to deal with processing the rest of the dataset.


One huge oversight I made is that there's ~400k allegedly-unique speakers among the portion of the dataset I collected. A bit of a pain since I made the assumption each group was it's own speaker, so I have to work around having to juggle that many speakers.

e-c-k-e-r commented 2 months ago

I think it's promising? A few user-error hiccups:

My expectations are pretty good. I think my only regret is throwing too many changes at once again (a handful of different languages, the "use the most similar utterance" feature, more STT). It's hard to gauge what really helped, but I can't complain if it all helped together.

Tomorrow I should have some time mucking with the model and seeing (hearing) if it's as good as it looks.


I botched the duration "fix" post-training with an old copy of the tokenizer from July (which shouldn't affect things but a few missing phonemes might cause issues with it training those phonemes against ), but the few results I tested are very pleasing with actually following the prompted speaker, at least the couple of voices I test. I uploaded the "botched" model to https://huggingface.co/ecker/vall-e/. I should have it fixed for tomorrow (the 25th).


mmm... I had to go back to my 4xV100 system for the duration-post-fix training, ROCm is just being too much of a pill. I think I still need to bake it more since it only had a few hours sadly (I only realized to use my 4xV100s towards the evening). My notes so far:

I pushed the weights to the HF repo, but I think I need to set aside a good day to let the post-fix training carry out, since I feel like 40% of outputs have extra junk at the end from the stop token taking longer to pop up. And hopefully that can help fill the gaps of voices it's not so good at if I elect to pick sampling by speaker rather than paths. It definitely has potential, but it falling apart on regular people voices has my doubts.

e-c-k-e-r commented 2 months ago

I guess I'll give some final thoughts.

For Emilia specifically:

Will definitely recommend to use for any speech dataset. For the size I used, it performed well.

Now, for the model specifically:

I'm pretty sure this won't be my final touches with the model, but until I get a breakthrough again (between another dataset or training technique like I did here), these should be my final thoughts on the model itself. The two core issues it seems to have now is between reduced quality / artifacting from the NAR and some voices not mapping accurately and precisely enough. The former requires more post-training and hoping I can try and prioritize the NAR more without lobotomizing the AR, and the latter I don't really have much of an idea on fixing without more post-training too.

That aside, I'll try and get the demo page updated with the current performing outputs when I do my finishing touches. I tried doing it the other day and it seemed mostly fine, but struggling for some speakers.