Open dutchsing009 opened 7 months ago
First was a missing file, I added the cleaner.py to the repo. Second - hm. Looks somehow like wrong stable version, can I get more infos? Which line does this occur? Third: you need environment variable COQUI_MODEL_PATH pointing to the folder that contains the v2.0.2 XTTS model. Sorry this is all still very, very raw code.
no problems take your time! thanks for the new ideas. i can keep debugging this till it is complete :)
Now split_dataset.py gives this error
Traceback (most recent call last):
File "/content/WhoSpeaks/split_dataset.py", line 12, in
This was another missing file, I added it (num_to_words.py)
Ok now split_dataset.py perfectly works even the word_timestamps thingy disappeared. but auto_diarize still gets this error:
Loading TTS model
Traceback (most recent call last):
File "/content/WhoSpeaks/auto_diarize.py", line 25, in local_models_path = os.environ.get("/content/WhoSpeaks/model")
Try local_models_path = "/content/WhoSpeaks/model". The XTTS model should then be in "/content/WhoSpeaks/model/v2.0.2".
local_models_path should be set to path to your XTTS models. With os.environ.get it tries to read the folder name from the environment variable "COQUI_MODEL_PATH". So either you directly set the folder in the local_models_path variable in the code or you create an environment variable "COQUI_MODEL_PATH" and put the folder into that env variable.
oh ok i moved from that actually but forgot to add here . now i get this
Loading TTS model
Using model: xtts
TTS model loaded
Traceback (most recent call last):
File "/content/WhoSpeaks/auto_diarize.py", line 57, in
Hm, strange. Are there sentences in output_sentences_wav folder? If not, pls use convert_wav.py first.
super sorry , totally forgot about it since i can't see the extension mp3/wav of the file as the name is long. Yeah all working good now
It asks me in auto_diarize.py Enter the number of speakers (clusters) you have identified
incase of the default CoinToss scene you provided I entered 2 and it worked , but what if i have a podcast or an episode that i don't know the actual number of speakers ? i thought "auto" just auto identify unlike speaker_diarize.py
I still have to work something reliable out for automatical speaker number detection, filename may be misleading sorry. It was the original intention, but didnt overcome all challenges so far then forgot to rename. Must be somehow possible algorithmicly though. It is not too hard for a human looking at the dendragram.
take your time man, actually this method is illegally good, one funny thing is that since it is whisper based , sometimes whisper doesn't transcribe certain sounds or unhearable words , then it won't appear in the diarization process which is a plus , as if it is diarizing and cleaning in the same time. diarizing only important words which would help in dataset making for tts, but not that good in video captioning or live performance if you know what i mean.
also yeah worst case scenario to add an independent Clustering method to count speakers ?? https://github.com/tango4j/Auto-Tuning-Spectral-Clustering or any other simple repo
I'll look into that repo, thank you for that one. Not that sure this method works for movies, only tried podcasts and scenes without much background noise so far. I think speaker embedding will probably not that tolerant towards that. For movies you'd probably want that less good to hear voice noises too. Hard to draw the line towards background noises I guess. Another thing I have to think about.
Implemented an automatic speaker number detection algorithm. Manual overwrite remains an option, with a suggested number provided beforehand.
Wonderful i have been trying it for the last 2 hours on different stuff , and it is something special , but i don't think it gets the speaker count right at all , i had like 12 speakers and it says 4 ? im using large-v3 btw it is good with it
Ok, thanks a lot for the feedback. Seems that this automatical speaker count thing does not work reliably yet. At least the other speaker diarization engines fail often with that too, to my knowledge. I'd love to see the dendragram to this scenario, wondering if a human could sort it out or if the problem lies in the data.
Haven't tried such many voices so far, guess I have to put more work into that then. Some example youtube videos where it fails that hard would really be helpful. Added a realtime diarization file too, but since it uses the same "automatic speaker count" algo it maybe might not be that useful currently.
no problems take your time , i will try the latest commit "updated auto speaker count guess" and tell you what happens , i will also try realtime diarization although i saw it yesterday on your channel and it is amazing. i might upload you some of my experiments if you want so you take a dig at what happens when there are many speakers 10+.
Would appreciate, saves me some testing time.
mmm interesting so with the latest commit , it now gives me 2 speakers , before that commit it gave me 4
Automatical speaker count suggestion: 2 speakers.
it was Automatical speaker count suggestion: 4 speakers.
the first image counted this as 4 now it gives this image and counts it as 2
I also got an idea for another .py file that might be added to the scope of this project. I wonder if it could take a voice reference for a certain speaker for example using 1 minute 2, 5 of his voice reference , then it iterates over let's say 6-8 hours of audio and it extract his voice only ,that would be fun , and would be extremely extremely useful for dataset making.
Lol I already have that a extract_speaker_sentences.py, I can upload that if you want.
Just added it, if we both had the same idea, others will find that useful too
amazing , I'm downloading something big to test extract_speaker_sentences.py actual ability , so it would take time , but How long should the reference audio be, do you have any idea. like normally the more the better but idk with that one ,
6-10 seconds is fine. More than 10s is cutted of be coqui tts max_ref_len parameter. It's possible to train with more but you don't get that much better result.
ok so basically I cant even go through split_dataset.py , my audio file is 7 hours 990mb , it splits the audio into 50 chunks and every chunk takes around 6-7 minutes (transcribe,refine) on a P100 , it has been running for 8 hours ,
that's why I'm taking too long to respond , and can't even test extract_speaker_sentences.py yet :( , and I thought I was going to test it with 70 hours audio lol cant even go through 7 hours
Ok, wasn't aware it's so bad with large files. Stable whisper does the transcription because it yields the most precise timestamps. I can add support for faster-whisper timestamps, which are a bit less precise but way faster. May be a better option for large files. Or I implement something only VAD based without timestamps like in realtime_diarize.
Currently you can only switch refinement off, but that only saves like half of the time.
ok so I switched refinement off , and used tiny instead of largev-3. for the same 7 hours 990mb and it is taking almost 5 hours lol.
it is stuck on this (edit: not stuck it is moving of course but really slowly , it is like a loop that never ends)
and above in yellow it says Streaming output truncated to the last 5000 lines
.
Hey i tried to run this on colab and i get 3 kind of errors, When split_dataset.py is run: ImportError: cannot import name 'multilingual_cleaners' from 'cleaner' (/usr/local/lib/python3.10/dist-packages/cleaner/init.py)
so i remove it then i run it again and i get this TypeError: DecodingOptions.init() got an unexpected keyword argument 'word_timestamps'
When auto_diarize is run: Loading TTS model Traceback (most recent call last): File "/content/WhoSpeaks/auto_diarize.py", line 25, in
checkpoint = os.path.join(local_models_path, "v2.0.2")
File "/usr/lib/python3.10/posixpath.py", line 76, in join
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType
Here is the colab you can edit it , run it by GPU https://colab.research.google.com/drive/1Odmp1RCTvoWw25R8Du8Nl5n6J-jW3FXk?usp=sharing