Open kunibald413 opened 2 months ago
101k hours
Hot damn. I think my own collection caps out at ~14K hours between pieces of LibriSpeech and audiobooks, and a very, very small smattering from videogames. Even the smaller languages being >1K hours is a huge help, since the biggest hurdle for me was trying to even find a large enough corpus to piece together for my own dataset for just Japanese.
2.4TB
Daunting, but I'll see if I have some spare spinning rust I can store it and pick at it. If anything I might just start with the smaller languages to squeeze out some multi-lingual-ness that I keep meaning to go about doing.
There being transcriptions already help a ton, as a half of the pain with audio processing is the time waiting to transcribe. The other half having to quantize it all through EnCodec is a bit of a boon still, but I think my recent setup of being able to split across GPUs should help.
Having more audio in general, and especially non-monotonous utterances, should help a ton. I'm pretty sure I already hit immense diminishing returns after an epoch.
Appreciated. I'll see what I can pick at over the next few days while my spark hasn't waned yet once again.
doing large preps and trains can be soul-crushing. passion and a little bit of compulsion can keep you going, but i hope you don't burn yourself out.
doing large preps and trains can be soul-crushing.
It's not so bad this go-around. It used to be agonizing with system instability (segfaults or hard reboots with anything under PyTorch) on my original training system with my 4070Ti. Swapping to my 7900XTX almost entirely resolved the problems for dataset preparation and non-important training.
The estimated week-and-a-half wait for the dataset to process is always a good time for any last-minute checks or ideas to get added; for example: something to solve my dillema of subpar zero-shot performance that may stem from my naive prompt sampling (for months I've been entertaining the idea of using a vector store for input prompts to sample the closest utterance for a given training sample).
passion and a little bit of compulsion can keep you going, but i hope you don't burn yourself out.
I think I'm beyond my hyperfixate-then-burnout-then-hyperfixate cycles I've kept getting myself under, just seems to be lulls in progress until I grow a wild hair and fiddle with a new idea (for example, the STT task being quite promising for reducing the loss overall, so I hope putting emphasis on more languages would in turn help the model overall). The urgency and dread of trying to push out a decent model went away by the time I pushed out a quasi-decent ar+nar-llama-8
model with my 4xV100 system: it's rather quick to churn out a model with good results, and I don't need to keep bruteforcing additional epochs for very little gains like I did with the ar+nar-retnet-8
.
Although again that 50K hours of English audio is still daunting. I think the best approach is to download the N-smallest .tar
s to prioritize more speakers over speakers with a lot of utterances (which unfortunately a lot of datasets seem to prioritize the latter over the former).
Should have everything prepared for the next training session. Detailing the below mostly for my sake to have it in writing somewhere:
tar
files under I think 300MiB for 312 speakers) datasets, just to make sure I have enough data.
resp
embedding (similar to inputs for STT).As for the actual dataset to train against, I think I'm going to:
small
+medium
+duplicate
)
I'll just resume training from the existing ar+nar-tts+stt-llama-8
weights since I don't think I need to restart from scratch (as the model is still rather malleable from all the other tweaks I've glued on), but have the same dataset sampling method of "sort by duration with a fix batch size, and let it go from smallest to largest utterances" (as I still do not trust my batch-by-duration-size method to be stable).
My hope is that the Emilia dataset is the answer to my problems. As time goes on I'm more pilled on having a very, very good dataset rather than a large one, and I feel the big crux of these TTS systems are having a sloppy dataset. If results looks promising then that'll be my push to deal with processing the rest of the dataset.
One huge oversight I made is that there's ~400k allegedly-unique speakers among the portion of the dataset I collected. A bit of a pain since I made the assumption each group was it's own speaker, so I have to work around having to juggle that many speakers.
I think it's promising? A few user-error hiccups:
<unk>
'd.stt
tasks (despite being 25% likely), and LibriTTS-R
utterances picked (despite making up ~7% of the dataset).sample_max_duration_batch
+ sample_shuffle
(and 16 batch * 16 gradaccum) to "fix" the model (again) to work on a number of durations (a drawback from sorting by duration), and probably some benefit of training under bfloat16 instead of float16 + loss scaling (even though I feel the latter has "better" models to the former).My expectations are pretty good. I think my only regret is throwing too many changes at once again (a handful of different languages, the "use the most similar utterance" feature, more STT). It's hard to gauge what really helped, but I can't complain if it all helped together.
Tomorrow I should have some time mucking with the model and seeing (hearing) if it's as good as it looks.
I botched the duration "fix" post-training with an old copy of the tokenizer from July (which shouldn't affect things but a few missing phonemes might cause issues with it training those phonemes against
mmm... I had to go back to my 4xV100 system for the duration-post-fix training, ROCm is just being too much of a pill. I think I still need to bake it more since it only had a few hours sadly (I only realized to use my 4xV100s towards the evening). My notes so far:
I pushed the weights to the HF repo, but I think I need to set aside a good day to let the post-fix training carry out, since I feel like 40% of outputs have extra junk at the end from the stop token taking longer to pop up. And hopefully that can help fill the gaps of voices it's not so good at if I elect to pick sampling by speaker rather than paths. It definitely has potential, but it falling apart on regular people voices has my doubts.
I guess I'll give some final thoughts.
For Emilia specifically:
Will definitely recommend to use for any speech dataset. For the size I used, it performed well.
Now, for the model specifically:
I'm pretty sure this won't be my final touches with the model, but until I get a breakthrough again (between another dataset or training technique like I did here), these should be my final thoughts on the model itself. The two core issues it seems to have now is between reduced quality / artifacting from the NAR and some voices not mapping accurately and precisely enough. The former requires more post-training and hoping I can try and prioritize the NAR more without lobotomizing the AR, and the latter I don't really have much of an idea on fixing without more post-training too.
That aside, I'll try and get the demo page updated with the current performing outputs when I do my finishing touches. I tried doing it the other day and it seemed mostly fine, but struggling for some speakers.
have you seen this dataset? maybe it's better suited for zero-shot task, more natural speech than audiobook
https://github.com/open-mmlab/Amphion/blob/main/preprocessors/Emilia/README.md