Closed MethanJess closed 12 months ago
Hey, thank you for making this project! I ran this project on WSL2, everything went decently well (I had to do a LOT of trial and error though)
But then came the nightmare of using transcriber.py When I ran it, it put 24 minutes worth of audio in the "badaudio" folder, (251 audio segments) (The whole audio file is 3:57:31) I don't understand why these segments were "bad audio" the words were somewhat clear in these segments...
Then when it came to transcribing, there is no progress bar, so I just watched the "train_data.txt" increase its file size, until my file explorer stopped working (everything went blank) so I restarted my PC, (The train_data.txt stopped at 588kb when I restarted) and I ran the transcriber again, this time after about 30 minutes of waiting, the train_data.txt just suddenly stopped at 555kb, which is a smaller text file size than the one where when I restarted my computer. also the terminal is still blank, and I don't see a "val_list.txt" file anywhere, the train_data.txt seems to be missing 773 segments. also, the sequence in the txt is weird, it goes like "output0.wav|... output1.wav|... output10.wav|... output100.wav|..." then it goes back to output200 at then end...
So, could you PLEASE use WhisperX?, It's much faster, here is an example project of WhisperX being the segmenter and transcriber : https://github.com/JarodMica/audiosplitter_whisper Though, the segmenter in audiosplitter_whisper is actually kinda bad, (so keep this current one that looks for silences) but the transcriber is really good on whisper.
So i have tried whisper/whisperx and a few other repos. I am using what I am using because it was accurate for my dataset. This very likely isn't going to work for everyone. My fine tuning dataset was only an hour of audio. So even if I did switch out instructions for a different technique i'd still run into people having issues.
Some things you can check are, make sure its 24000khz, shorten that long wav file. Try like a 20 minute wav file, segment it, then transcribe it and see how it does. And the output is because its string sorting. As long as audio file names match the data in the txt it shouldnt matter. But, I can put a numerical sort on it.
I just changed it to numerical sort and fixxed the no val_list.txt output issue. Also curious how well auditok handled your segmentation and if you changed the length of the segments or not. Ultimately if my transcriber doesnt work for you you can use whatever, just get the data into a val_list.txt, train_list.txt and OOD_list.txt and you can continue on with the process.
Thanks a lot for fixing the val_list.txt output issue.
I did get an error from the 4 hour file I was using.
self._handle = _dlopen(self._name, mode) OSError: /tmp/tmp_zznlbdx/libespeak-ng.so.1.1.49: cannot map zero-fill pages
But I fixed this by reducing the audio into only 1 hour.
Also curious how well auditok handled your segmentation and if you changed the length of the segments or not.
The segmenter was pretty decent, and only took a second to segment, but some of the audio segments had a cut from another sentence, sometimes at the start, sometimes at the end of the segment, and some segments were just a 0.1 second audio file that just has a 'click' sound. I found this project to be way better at segmenting audio with silences. https://github.com/flutydeer/audio-slicer But this one is much slower (takes like 10 minutes) And no, I just I kept the length as default.
I am now getting this error when I run "python train_finetune.py --config_path ./Configs/config_ft.yml"
File "/home/mastaldal/anaconda3/envs/StyleTTS2/lib/python3.10/site-packages/transformers/models/albert/modeling_albert.py", line 719, in forward buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length) RuntimeError: The expanded size of the tensor (1042) must match the existing size (512) at non-singleton dimension 1. Target sizes: [7, 1042]. Tensor sizes: [1, 512]
I couldn't find anything about what this error means, i played with the batch_size and max_len, but nothing seems to fix this...
Also, what do you mean by "So until further guidance, you can get a new dataset and transcribe and segment it and just label the txt OOD_list.txt."? do i just copy my train_list.txt file and change its name to OOD_list.txt? or do should I use the included OOD_list in StyleTTS2?
Thanks a lot for fixing the val_list.txt output issue. I did get an error from the 4 hour file I was using.
self._handle = _dlopen(self._name, mode) OSError: /tmp/tmp_zznlbdx/libespeak-ng.so.1.1.49: cannot map zero-fill pages
But I fixed this by reducing the audio into only 1 hour.Also curious how well auditok handled your segmentation and if you changed the length of the segments or not.
The segmenter was pretty decent, and only took a second to segment, but some of the audio segments had a cut from another sentence, sometimes at the start, sometimes at the end of the segment, and some segments were just a 0.1 second audio file that just has a 'click' sound. I found this project to be way better at segmenting audio with silences. https://github.com/flutydeer/audio-slicer But this one is much slower (takes like 10 minutes) And no, I just I kept the length as default.
I am now getting this error when I run "python train_finetune.py --config_path ./Configs/config_ft.yml"
File "/home/mastaldal/anaconda3/envs/StyleTTS2/lib/python3.10/site-packages/transformers/models/albert/modeling_albert.py", line 719, in forward buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length) RuntimeError: The expanded size of the tensor (1042) must match the existing size (512) at non-singleton dimension 1. Target sizes: [7, 1042]. Tensor sizes: [1, 512]
I couldn't find anything about what this error means, i played with the batch_size and max_len, but nothing seems to fix this...
That error i go into addressing here. https://github.com/yl4579/StyleTTS2/issues/72. You have to swap a couple functions around essentially.
so I did some digging and the OOD_list.txt is just a lot of text with no reference audio, in theory it's supposed to show a wider range of speech. So it can be just about anything as long as its not training data. You can just use the OOD_list.txt that comes in the repo, yes. You need three text files, train_list.txt, val_list.txt, and OOD_list.txt. the script makes val_list and train_list.txt. and OOD_list.txt will already be there so dont worry about it.
Okay so, turns out the audio segments and text files were too long for StyleTTS2, I had to keep them to about 4 seconds long. but now I'm running into another problem
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB. GPU 0 has a total capacty of 8.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 14.21 GiB is allocated by PyTorch, and 419.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I played a lot with the batch size, nothing seemed to fix it, this issue could be caused by literally anything...
What are the specs of your system? StyleTTS2 is extremely ram heavy. Also I looked into whisper a bit more. I wrote a script that uses it along with segmentation and phonemization, so I'll push that to the repo here soon. There is a known issue about how the text doesn't have enough punctuation. So I tried to solve that. There is also a very recent update to the style repo that involves an accelerated fine tuning. Which I have included as well.
Edit: For reference I am using an Nvidia A6000 which has 48gb of VRAM and I am forced to a batch size of 2 and a max_len of 500.
Edit: For reference I am using an Nvidia A6000 which has 48gb of VRAM and I am forced to a batch size of 2 and a max_len of 500.
Ouch, that's really demanding, may i ask you how much VRAM does this setup takes? (out of those 48GB) I am thinking that maybe it could be still possible to finetune on 2xT4 32GB on Kaggle...
And don't you know whether batch size of 1 would really produce bad results? I have seen original repo and the author suggest getting at least batch size of 2...
Edit: For reference I am using an Nvidia A6000 which has 48gb of VRAM and I am forced to a batch size of 2 and a max_len of 500.
Ouch, that's really demanding, may i ask you how much VRAM does this setup takes? (out of those 48GB) I am thinking that maybe it could be still possible to finetune on 2xT4 32GB on Kaggle...
And don't you know whether batch size of 1 would really produce bad results? I have seen original repo and the author suggest getting at least batch size of 2...
I haven't tried batch size 1. That would take like three days at least. With a 45 min dataset. You may be able to get away with batch of 2 and dropping max len to maybe 300? I'm not sure though. It uses up all of my vram at batch 2 max len 500
Is it possible to finetune with RTX4090? with only 24G VRAM...
Hey, thank you for making this project! I ran this project on WSL2, everything went decently well (I had to do a LOT of trial and error though)
But then came the nightmare of using transcriber.py When I ran it, it put 24 minutes worth of audio in the "badaudio" folder, (251 audio segments) (The whole audio file is 3:57:31) I don't understand why these segments were "bad audio" the words were somewhat clear in these segments...
Then when it came to transcribing, there is no progress bar, so I just watched the "train_data.txt" increase its file size, until my file explorer stopped working (everything went blank) so I restarted my PC, (The train_data.txt stopped at 588kb when I restarted) and I ran the transcriber again, this time after about 30 minutes of waiting, the train_data.txt just suddenly stopped at 555kb, which is a smaller text file size than the one where when I restarted my computer. also the terminal is still blank, and I don't see a "val_list.txt" file anywhere, the train_data.txt seems to be missing 773 segments. also, the sequence in the txt is weird, it goes like "output0.wav|... output1.wav|... output10.wav|... output100.wav|..." then it goes back to output200 at then end...
So, could you PLEASE use WhisperX?, It's much faster, here is an example project of WhisperX being the segmenter and transcriber : https://github.com/JarodMica/audiosplitter_whisper Though, the segmenter in audiosplitter_whisper is actually kinda bad, (so keep this current one that looks for silences) but the transcriber is really good on whisper.