e-c-k-e-r / vall-e

An unofficial PyTorch implementation of VALL-E
GNU Affero General Public License v3.0
68 stars 5 forks source link

Emilia dataset #2

Open kunibald413 opened 1 week ago

kunibald413 commented 1 week ago

have you seen this dataset? maybe it's better suited for zero-shot task, more natural speech than audiobook

https://github.com/open-mmlab/Amphion/blob/main/preprocessors/Emilia/README.md

e-c-k-e-r commented 1 week ago

101k hours

Hot damn. I think my own collection caps out at ~14K hours between pieces of LibriSpeech and audiobooks, and a very, very small smattering from videogames. Even the smaller languages being >1K hours is a huge help, since the biggest hurdle for me was trying to even find a large enough corpus to piece together for my own dataset for just Japanese.

2.4TB

Daunting, but I'll see if I have some spare spinning rust I can store it and pick at it. If anything I might just start with the smaller languages to squeeze out some multi-lingual-ness that I keep meaning to go about doing.

There being transcriptions already help a ton, as a half of the pain with audio processing is the time waiting to transcribe. The other half having to quantize it all through EnCodec is a bit of a boon still, but I think my recent setup of being able to split across GPUs should help.


Having more audio in general, and especially non-monotonous utterances, should help a ton. I'm pretty sure I already hit immense diminishing returns after an epoch.

Appreciated. I'll see what I can pick at over the next few days while my spark hasn't waned yet once again.

kunibald413 commented 1 week ago

doing large preps and trains can be soul-crushing. passion and a little bit of compulsion can keep you going, but i hope you don't burn yourself out.

e-c-k-e-r commented 1 week ago

doing large preps and trains can be soul-crushing.

It's not so bad this go-around. It used to be agonizing with system instability (segfaults or hard reboots with anything under PyTorch) on my original training system with my 4070Ti. Swapping to my 7900XTX almost entirely resolved the problems for dataset preparation and non-important training.

The estimated week-and-a-half wait for the dataset to process is always a good time for any last-minute checks or ideas to get added; for example: something to solve my dillema of subpar zero-shot performance that may stem from my naive prompt sampling (for months I've been entertaining the idea of using a vector store for input prompts to sample the closest utterance for a given training sample).

passion and a little bit of compulsion can keep you going, but i hope you don't burn yourself out.

I think I'm beyond my hyperfixate-then-burnout-then-hyperfixate cycles I've kept getting myself under, just seems to be lulls in progress until I grow a wild hair and fiddle with a new idea (for example, the STT task being quite promising for reducing the loss overall, so I hope putting emphasis on more languages would in turn help the model overall). The urgency and dread of trying to push out a decent model went away by the time I pushed out a quasi-decent ar+nar-llama-8 model with my 4xV100 system: it's rather quick to churn out a model with good results, and I don't need to keep bruteforcing additional epochs for very little gains like I did with the ar+nar-retnet-8.


Although again that 50K hours of English audio is still daunting. I think the best approach is to download the N-smallest .tars to prioritize more speakers over speakers with a lot of utterances (which unfortunately a lot of datasets seem to prioritize the latter over the former).

e-c-k-e-r commented 3 hours ago

Should have everything prepared for the next training session. Detailing the below mostly for my sake to have it in writing somewhere:

As for the actual dataset to train against, I think I'm going to:

I'll just resume training from the existing ar+nar-tts+stt-llama-8 weights since I don't think I need to restart from scratch (as the model is still rather malleable from all the other tweaks I've glued on), but have the same dataset sampling method of "sort by duration with a fix batch size, and let it go from smallest to largest utterances" (as I still do not trust my batch-by-duration-size method to be stable).

My hope is that the Emilia dataset is the answer to my problems. As time goes on I'm more pilled on having a very, very good dataset rather than a large one, and I feel the big crux of these TTS systems are having a sloppy dataset. If results looks promising then that'll be my push to deal with processing the rest of the dataset.