Emilia dataset - Githubissues

have you seen this dataset? maybe it's better suited for zero-shot task, more natural speech than audiobook

https://github.com/open-mmlab/Amphion/blob/main/preprocessors/Emilia/README.md

101k hours

Hot damn. I think my own collection caps out at ~14K hours between pieces of LibriSpeech and audiobooks, and a very, very small smattering from videogames. Even the smaller languages being >1K hours is a huge help, since the biggest hurdle for me was trying to even find a large enough corpus to piece together for my own dataset for just Japanese.

2.4TB

Daunting, but I'll see if I have some spare spinning rust I can store it and pick at it. If anything I might just start with the smaller languages to squeeze out some multi-lingual-ness that I keep meaning to go about doing.

There being transcriptions already help a ton, as a half of the pain with audio processing is the time waiting to transcribe. The other half having to quantize it all through EnCodec is a bit of a boon still, but I think my recent setup of being able to split across GPUs should help.

Having more audio in general, and especially non-monotonous utterances, should help a ton. I'm pretty sure I already hit immense diminishing returns after an epoch.

Appreciated. I'll see what I can pick at over the next few days while my spark hasn't waned yet once again.

doing large preps and trains can be soul-crushing. passion and a little bit of compulsion can keep you going, but i hope you don't burn yourself out.

doing large preps and trains can be soul-crushing.

It's not so bad this go-around. It used to be agonizing with system instability (segfaults or hard reboots with anything under PyTorch) on my original training system with my 4070Ti. Swapping to my 7900XTX almost entirely resolved the problems for dataset preparation and non-important training.

The estimated week-and-a-half wait for the dataset to process is always a good time for any last-minute checks or ideas to get added; for example: something to solve my dillema of subpar zero-shot performance that may stem from my naive prompt sampling (for months I've been entertaining the idea of using a vector store for input prompts to sample the closest utterance for a given training sample).

passion and a little bit of compulsion can keep you going, but i hope you don't burn yourself out.

I think I'm beyond my hyperfixate-then-burnout-then-hyperfixate cycles I've kept getting myself under, just seems to be lulls in progress until I grow a wild hair and fiddle with a new idea (for example, the STT task being quite promising for reducing the loss overall, so I hope putting emphasis on more languages would in turn help the model overall). The urgency and dread of trying to push out a decent model went away by the time I pushed out a quasi-decent ar+nar-llama-8 model with my 4xV100 system: it's rather quick to churn out a model with good results, and I don't need to keep bruteforcing additional epochs for very little gains like I did with the ar+nar-retnet-8.

Although again that 50K hours of English audio is still daunting. I think the best approach is to download the N-smallest .tars to prioritize more speakers over speakers with a lot of utterances (which unfortunately a lot of datasets seem to prioritize the latter over the former).

Should have everything prepared for the next training session. Detailing the below mostly for my sake to have it in writing somewhere:

I added both the French and a portion of the English (tar files under I think 300MiB for 312 speakers) datasets, just to make sure I have enough data.
- I would have added the Korean one and some of the Chinese dataset too, but I didn't want to have to wrestle with the phonemizer to get it to work like I've had to with Japanese (although the more I think about it, Korean shouldn't have issues, and Chinese I can probably find a better phonemizer/procedure for). There's also the issue of updating the tokenizer, since Japanese gave some unique IPAs that I had to remap the last time I tried working with it.
(Theoretically) I have my "use the most similar utterance for a given sample" system added to hopefully better help zero-shot voice cloning. Theoretically, since I need to:
- Verify it works when integrated. It passes my mental model, and separate testing shows that it does "work".
- Verify that what it thinks is similar is adequate enough, since it's primarily just relying on cosine similarities of audio passed through the resp embedding (similar to inputs for STT).
- Verify that my tokenizer does contain all the phonemes (and not have a repeat of missing phonemes with the small Japanese portion of my dataset).
- Actually compute the similarity metadata, since I have yet to actually integrate it into the metadata generation procedure (I'm not looking forward to seeing how long it would take to process).
- Do the final processed data => HDF5 and shove it to my 4xV100 system (but there's been some weird Linux instability after a while and the entire file gets tainted when it's not closed properly, so I'll see what I can do about it).

As for the actual dataset to train against, I think I'm going to:

drop the donated audiobooks I've had to supplement my dataset, and LibriSpeech (small+medium+duplicate)
- The model was already trained a majority of the time against this, and I don't want a majority of this training session dedicated to a control dataset.
- I feel it would be a huge pain to compute the similarity metadata on speakers with a xboxhueg amount of utterances.
- Both of these pieces of my dataset are concerningly transcribed and trimmed from whole audio files, so I don't really trust how whisperX decides to slice it at.
- They're the most audiobook-y.
retain LibriTTS-R, as it's not sliced, already transcribed, and I feel is the cleanest besides Emilia.
retain all my other handpicked vidya-derived audio, since they're still rather small compared to the rest of the dataset, and they're also not audiobook-y.

I'll just resume training from the existing ar+nar-tts+stt-llama-8 weights since I don't think I need to restart from scratch (as the model is still rather malleable from all the other tweaks I've glued on), but have the same dataset sampling method of "sort by duration with a fix batch size, and let it go from smallest to largest utterances" (as I still do not trust my batch-by-duration-size method to be stable).

My hope is that the Emilia dataset is the answer to my problems. As time goes on I'm more pilled on having a very, very good dataset rather than a large one, and I feel the big crux of these TTS systems are having a sloppy dataset. If results looks promising then that'll be my push to deal with processing the rest of the dataset.

One huge oversight I made is that there's ~400k allegedly-unique speakers among the portion of the dataset I collected. A bit of a pain since I made the assumption each group was it's own speaker, so I have to work around having to juggle that many speakers.

I think it's promising? A few user-error hiccups:

initially started training on my 7900XTX, but had to scrap those weights:
- VRAM usage was sporadic, I assume just ROCm things, so training would cut off randomly during SDPA.
- some phonemes were being <unk>'d.
- the "pick the most similar utterance" code was wrong for a number of silly reasons, so it didn't benefit from that.
- despite this, the training loss seemed to be teetering towards the low 2.xs.
ironed out the kinks in the code, threw everything onto my 4xV100 machine:
- ...despite the first 3 second utterancess => 12 second utterances session used only one V100 for the entire 25 hours, so the model trained under 16 batch 4 gradaccum instead of 16 batch 4 gradaccum 4 GPUs (for smoother gradients). I don't think it's that* big of a deal.
- 12 second utterances onward trained properly, although I had to adjust from 16 batch 4 gradaccum to 8 batch 8 gradaccum.
the metrics are interesting:
- I'm not too sure what causes those little restarts (they show even without EMA smoothing), but up to 52k is the initial 3 seconds => 12 seconds batch (after that is the 12 seconds onwards).
- I wouldn't be surprised if it's just luck with the batch having the "right" amount of RVQ 0s (despite being 50% likely), stt tasks (despite being 25% likely), and LibriTTS-R utterances picked (despite making up ~7% of the dataset).
- the loss/accuracy seems promising at least.
- the evaluation/validation samples seem promising, although there seems to be something wrong with the evaluation dataloader giving me nothing but German utterances so far.
currently doing final training touches on my 7900XTX with sample_max_duration_batch + sample_shuffle (and 16 batch * 16 gradaccum) to "fix" the model (again) to work on a number of durations (a drawback from sorting by duration), and probably some benefit of training under bfloat16 instead of float16 + loss scaling (even though I feel the latter has "better" models to the former).

My expectations are pretty good. I think my only regret is throwing too many changes at once again (a handful of different languages, the "use the most similar utterance" feature, more STT). It's hard to gauge what really helped, but I can't complain if it all helped together.

~~Tomorrow I should have some time mucking with the model and seeing (hearing) if it's as good as it looks.~~

I botched the duration "fix" post-training with an old copy of the tokenizer from July (which shouldn't affect things but a few missing phonemes might cause issues with it training those phonemes against ), but the few results I tested are very pleasing with actually following the prompted speaker, at least the couple of voices I test. I uploaded the "botched" model to https://huggingface.co/ecker/vall-e/. I should have it fixed for tomorrow (the 25th).

mmm... I had to go back to my 4xV100 system for the duration-post-fix training, ROCm is just being too much of a pill. I think I still need to bake it more since it only had a few hours sadly (I only realized to use my 4xV100s towards the evening). My notes so far:

when it works, it really works. It's too eerie on how similar it can get to the prompt, but I'm probably biased from expecting it to be removed far enough from the original speaker.
however, voices that don't belong to a normal, typical corpus (instead of being from Libri*/Emilia, being from my vidya-derived corpus), kind of break apart with the same kind of crusty male voice.
- I do notice it does preserve the acoustic part of the input prompt eerily well sometimes (if there's a reverb or if it has a through-a-speaker filter effect).
- I had to double check that the prior weights had the same issue where it just falls apart for these voices, and I suppose it did. I could have sworn some voices still somewhat tried to be similar.
I feel like it does a good job at Japanese, at least from my cursory tests, even with non-Japanese speakers. I need a concrete test for the other way around, and for German and French.

I pushed the weights to the HF repo, but I think I need to set aside a good day to let the post-fix training carry out, since I feel like 40% of outputs have extra junk at the end from the stop token taking longer to pop up. And hopefully that can help fill the gaps of voices it's not so good at if I elect to pick sampling by speaker rather than paths. It definitely has potential, but it falling apart on regular people voices has my doubts.

I guess I'll give some final thoughts.

For Emilia specifically:

for what it's worth, it's a decent dataset, as it offers something both LibriSpeech and audiobook readings don't really offer: a ton of speakers and some variety between those speakers.
the additional languages seem just enough to provide a good baseline to do cross-lingual inferences, and a strong baseline to finetune to other speakers (although it's conjecture based on how some light tests, but I know my problem with not being able to finetune to a Japanese voice with my previous portion of the dataset was because of how small that was).
it already being trimmed and transcribed is a huge plus. The agony of dealing with LibriSpeech is the sheer amount of time transcribing with WhisperX and hoping it's accurate enough.

Will definitely recommend to use for any speech dataset. For the size I used, it performed well.

Now, for the model specifically:

I think the "pick the most similar utterance" prompt sampling approach is a net good. Prompt adherence is really strong when it works.
However, voices it's not so confident in seems to vary between having artifacts in the audio (a NAR issue), or the latent space it maps the input prompt into being in a close ball park, but varying a little too much between inferences.
- I think the former problem is simply just from there being little utterances for a speaker compared to the big players in the dataset. The caveat of training a unified AR/NAR model is that there's not a guarantee a speaker will receive proper training that covers all eight RVQ levels unless there's a ton of utterances for that speaker (such as with the audiobook-derived corpus-es); Emilia has a lot of speakers that have a small handful of utterances.
A lot of my previous issues with output quality/performance actually stemmed from a bug with the input audio not resampling properly, which would explain why things I thought the old model being able to do "decently" ended up being shitty.
- Despite this, overall output quality seems to have improved between model weights.
Multi-lingual quality is adequate, but is much more likely to suffer from the former artifacting issues rather than the latter "latent mapping" problems.
- I will say I'm rather surprised it doesn't have the latter "latent mapping" problem, given the speakers I tested against having a very small presence in the dataset.
Outputs are almost consistent, but sometimes there'll be some issues in the output. I suppose samplers could fix this, since my tests stopped bothering with anything beyond AR temp 1.0 / NAR temp 0.0.

I'm pretty sure this won't be my final touches with the model, but until I get a breakthrough again (between another dataset or training technique like I did here), these should be my final thoughts on the model itself. The two core issues it seems to have now is between reduced quality / artifacting from the NAR and some voices not mapping accurately and precisely enough. The former requires more post-training and hoping I can try and prioritize the NAR more without lobotomizing the AR, and the latter I don't really have much of an idea on fixing without more post-training too.

That aside, I'll try and get the demo page updated with the current performing outputs when I do my finishing touches. I tried doing it the other day and it seemed mostly fine, but struggling for some speakers.

e-c-k-e-r / vall-e

Emilia dataset #2