KoljaB / WhoSpeaks

Efficient approach to speaker diarization using voice characteristics extraction
28 stars 4 forks source link

3 errors #2

Open dutchsing009 opened 2 months ago

dutchsing009 commented 2 months ago

Hey i tried to run this on colab and i get 3 kind of errors, When split_dataset.py is run: ImportError: cannot import name 'multilingual_cleaners' from 'cleaner' (/usr/local/lib/python3.10/dist-packages/cleaner/init.py)

so i remove it then i run it again and i get this TypeError: DecodingOptions.init() got an unexpected keyword argument 'word_timestamps'

When auto_diarize is run: Loading TTS model Traceback (most recent call last): File "/content/WhoSpeaks/auto_diarize.py", line 25, in checkpoint = os.path.join(local_models_path, "v2.0.2") File "/usr/lib/python3.10/posixpath.py", line 76, in join a = os.fspath(a) TypeError: expected str, bytes or os.PathLike object, not NoneType

Here is the colab you can edit it , run it by GPU https://colab.research.google.com/drive/1Odmp1RCTvoWw25R8Du8Nl5n6J-jW3FXk?usp=sharing

KoljaB commented 2 months ago

First was a missing file, I added the cleaner.py to the repo. Second - hm. Looks somehow like wrong stable version, can I get more infos? Which line does this occur? Third: you need environment variable COQUI_MODEL_PATH pointing to the folder that contains the v2.0.2 XTTS model. Sorry this is all still very, very raw code.

dutchsing009 commented 2 months ago

no problems take your time! thanks for the new ideas. i can keep debugging this till it is complete :) Now split_dataset.py gives this error Traceback (most recent call last): File "/content/WhoSpeaks/split_dataset.py", line 12, in from cleaner import multilingual_cleaners File "/content/WhoSpeaks/cleaner.py", line 11, in from num_to_words import TextNorm as zh_num2words ModuleNotFoundError: No module named 'num_to_words' and when i change line 11 in cleaner.py from num_to_words to num2words ImportError: cannot import name 'TextNorm' from 'num2words' (/usr/local/lib/python3.10/dist-packages/num2words/init.py)

KoljaB commented 2 months ago

This was another missing file, I added it (num_to_words.py)

dutchsing009 commented 2 months ago

Ok now split_dataset.py perfectly works even the word_timestamps thingy disappeared. but auto_diarize still gets this error: Loading TTS model Traceback (most recent call last): File "/content/WhoSpeaks/auto_diarize.py", line 25, in checkpoint = os.path.join(local_models_path, "v2.0.2") File "/usr/lib/python3.10/posixpath.py", line 76, in join a = os.fspath(a) TypeError: expected str, bytes or os.PathLike object, not NoneType i have put the model path in line 24 local_models_path = os.environ.get("/content/WhoSpeaks/model")

KoljaB commented 2 months ago

Try local_models_path = "/content/WhoSpeaks/model". The XTTS model should then be in "/content/WhoSpeaks/model/v2.0.2".

local_models_path should be set to path to your XTTS models. With os.environ.get it tries to read the folder name from the environment variable "COQUI_MODEL_PATH". So either you directly set the folder in the local_models_path variable in the code or you create an environment variable "COQUI_MODEL_PATH" and put the folder into that env variable.

dutchsing009 commented 2 months ago

oh ok i moved from that actually but forgot to add here . now i get this Loading TTS model Using model: xtts TTS model loaded Traceback (most recent call last): File "/content/WhoSpeaks/auto_diarize.py", line 57, in embeddings_scaled = scaler.fit_transform(embeddings_array) File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py", line 295, in wrapped data_to_wrap = f(self, X, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1098, in fit_transform return self.fit(X, fit_params).transform(X) File "/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_data.py", line 876, in fit return self.partial_fit(X, y, sample_weight) File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1474, in wrapper return fit_method(estimator, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_data.py", line 912, in partial_fit X = self._validate_data( File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 633, in _validate_data out = check_array(X, input_name="X", check_params) File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py", line 1035, in check_array raise ValueError(msg) ValueError: Expected 2D array, got 1D array instead: array=[]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

KoljaB commented 2 months ago

Hm, strange. Are there sentences in output_sentences_wav folder? If not, pls use convert_wav.py first.

dutchsing009 commented 2 months ago

super sorry , totally forgot about it since i can't see the extension mp3/wav of the file as the name is long. Yeah all working good now

dutchsing009 commented 2 months ago

It asks me in auto_diarize.py Enter the number of speakers (clusters) you have identified incase of the default CoinToss scene you provided I entered 2 and it worked , but what if i have a podcast or an episode that i don't know the actual number of speakers ? i thought "auto" just auto identify unlike speaker_diarize.py

KoljaB commented 2 months ago

I still have to work something reliable out for automatical speaker number detection, filename may be misleading sorry. It was the original intention, but didnt overcome all challenges so far then forgot to rename. Must be somehow possible algorithmicly though. It is not too hard for a human looking at the dendragram.

dutchsing009 commented 2 months ago

take your time man, actually this method is illegally good, one funny thing is that since it is whisper based , sometimes whisper doesn't transcribe certain sounds or unhearable words , then it won't appear in the diarization process which is a plus , as if it is diarizing and cleaning in the same time. diarizing only important words which would help in dataset making for tts, but not that good in video captioning or live performance if you know what i mean.

dutchsing009 commented 2 months ago

also yeah worst case scenario to add an independent Clustering method to count speakers ?? https://github.com/tango4j/Auto-Tuning-Spectral-Clustering or any other simple repo

KoljaB commented 2 months ago

I'll look into that repo, thank you for that one. Not that sure this method works for movies, only tried podcasts and scenes without much background noise so far. I think speaker embedding will probably not that tolerant towards that. For movies you'd probably want that less good to hear voice noises too. Hard to draw the line towards background noises I guess. Another thing I have to think about.

KoljaB commented 2 months ago

Implemented an automatic speaker number detection algorithm. Manual overwrite remains an option, with a suggested number provided beforehand.

dutchsing009 commented 2 months ago

Wonderful i have been trying it for the last 2 hours on different stuff , and it is something special , but i don't think it gets the speaker count right at all , i had like 12 speakers and it says 4 ? im using large-v3 btw it is good with it

KoljaB commented 2 months ago

Ok, thanks a lot for the feedback. Seems that this automatical speaker count thing does not work reliably yet. At least the other speaker diarization engines fail often with that too, to my knowledge. I'd love to see the dendragram to this scenario, wondering if a human could sort it out or if the problem lies in the data.

Haven't tried such many voices so far, guess I have to put more work into that then. Some example youtube videos where it fails that hard would really be helpful. Added a realtime diarization file too, but since it uses the same "automatic speaker count" algo it maybe might not be that useful currently.

dutchsing009 commented 2 months ago

no problems take your time , i will try the latest commit "updated auto speaker count guess" and tell you what happens , i will also try realtime diarization although i saw it yesterday on your channel and it is amazing. i might upload you some of my experiments if you want so you take a dig at what happens when there are many speakers 10+.

KoljaB commented 2 months ago

Would appreciate, saves me some testing time.

dutchsing009 commented 2 months ago

mmm interesting so with the latest commit , it now gives me 2 speakers , before that commit it gave me 4 Automatical speaker count suggestion: 2 speakers. it was Automatical speaker count suggestion: 4 speakers.

dutchsing009 commented 2 months ago

the first image counted this as 4 d256f0ad-9722-4b1e-9918-a66bb5aa10d3 now it gives this image and counts it as 2 f0054917-5df2-41b0-b8ed-0b5de633e625

dutchsing009 commented 2 months ago

I also got an idea for another .py file that might be added to the scope of this project. I wonder if it could take a voice reference for a certain speaker for example using 1 minute 2, 5 of his voice reference , then it iterates over let's say 6-8 hours of audio and it extract his voice only ,that would be fun , and would be extremely extremely useful for dataset making.

KoljaB commented 2 months ago

Lol I already have that a extract_speaker_sentences.py, I can upload that if you want.

KoljaB commented 2 months ago

Just added it, if we both had the same idea, others will find that useful too

dutchsing009 commented 2 months ago

amazing , I'm downloading something big to test extract_speaker_sentences.py actual ability , so it would take time , but How long should the reference audio be, do you have any idea. like normally the more the better but idk with that one ,

KoljaB commented 2 months ago

6-10 seconds is fine. More than 10s is cutted of be coqui tts max_ref_len parameter. It's possible to train with more but you don't get that much better result.

dutchsing009 commented 2 months ago

ok so basically I cant even go through split_dataset.py , my audio file is 7 hours 990mb , it splits the audio into 50 chunks and every chunk takes around 6-7 minutes (transcribe,refine) on a P100 , it has been running for 8 hours ,

dutchsing009 commented 2 months ago

that's why I'm taking too long to respond , and can't even test extract_speaker_sentences.py yet :( , and I thought I was going to test it with 70 hours audio lol cant even go through 7 hours

KoljaB commented 2 months ago

Ok, wasn't aware it's so bad with large files. Stable whisper does the transcription because it yields the most precise timestamps. I can add support for faster-whisper timestamps, which are a bit less precise but way faster. May be a better option for large files. Or I implement something only VAD based without timestamps like in realtime_diarize.

KoljaB commented 2 months ago

Currently you can only switch refinement off, but that only saves like half of the time.

dutchsing009 commented 2 months ago

ok so I switched refinement off , and used tiny instead of largev-3. for the same 7 hours 990mb and it is taking almost 5 hours lol. it is stuck on this (edit: not stuck it is moving of course but really slowly , it is like a loop that never ends) and above in yellow it says Streaming output truncated to the last 5000 lines.

Screenshot 2024-04