Question: what is the best approach to train speaker diarizer

mabajec commented 1 year ago

Hi,

I need to use speaker diarization in a pipeline with ASR for Slovenian language. For the speech recognition I use a highly accurate CTC Conformer model.

At first, I tried to employ pre-trained titanet-large and diar_msdd_telephonic models for clustering and neural diarization, respectively. But results are not satisfactory. In particular not for audio files with multiple speakers, e.g. more than 10. For larger files I would often get one label only. And very often, in speech segments belonging to one speaker, single words would be misclassified. I experimented a lot with all sorts of parameter combinations and pre-set config files (diar_infer_meeting, diar_infer_general, ...) but no success.

Then I trained from scratch a new titanet-large model following instructions on NeMo github and also finetuned one with my own dataset. The dataset includes 2500 unique speakers and approx. 700 hours of speech. Audio files are from 1 to 20 seconds long. Considering the loss function, both models converged well with the one trained from scratch ending up at val_loss=0,13 and finetuned one at 0,00006. When I use these two models for clustering diarization, however, results are still quite disappointing – short audio recordings with few speakers are diarized well but immediately when there are more than a few speakers, DER increases significantly. Results are sometimes even worse if I pre-set the number of speakers.

I also trained own MSDD model by following the tutorial “Speaker Diarization Training". I tried both, training by freezing speaker embedding extractor model and by end-to-end training of both together. But still, I don't get any better results. In fact, there is almost no difference between the results I get by clustering diarization alone and by using neural diarization on the final step.

I guess I must be doing something wrong. Could somebody please let me know, what is the best approach to get this done. I need a diarizer that will perform well for recordings with just a few or multiple speakers, which will very often be unknown (I don't have them in my training dataset). Specifically:

Should I train or rather finetune a speaker embedding extractor model (e.g. titanet-large) as suggested in this tutorial and then employ clustering diarization without msdd? If so, is there anything else apart from what is instructed in the Speaker Verification tutorial that I must change or pay attention to to train speaker embedding extractor model for the diarization purposes (the tutorial is on verification not diarization)? Is the dataset I have (700 hours, 2500 unique speakers) good enough or I need more data. What is suggested duration for recordings?
Or should I rather train msdd with frozen speaker embeddings extractor or even end-to-end as suggested here and then use neural diarization on top of the clustering results? Are there any specifics I need to know apart from what is explained in the Speaker Diarization Training tutorial?

If required, I can provide details on config settings that I was using for my previous experiments.

Thanks in advance.

tango4j commented 1 year ago

Hi, I'd like to share my thoughts on the issues you're encountering:

(1) Speaker Counting Error: From my experience, diarizing more than 8 speakers is quite challenging. This is also the case with other speaker diarization APIs in CSPs. In my experience, the accuracy of speaker counting is unlikely to improve significantly by re-training the speaker embedding model. having said that, I suggest tuning the diarizer.clustering.parameters in this yaml file. The recommended range for max_rp_threshold is between 0.03 and 0.2, and for sparse_search_volume, it's between 5 and 30. I assume you've already adjusted the max_num_speakers to 10 or more as needed.

(2) Short Segment Error: If you're frequently encountering short segment errors, consider adjusting the multiscale weights in speaker_embeddings.parameters.multiscale_weights. Giving more weight to longer scales, for instance, [1.4,1.3,1.2,1.1,1.0], could potentially reduce these errors.

(3) Fine-tuning Strategy: Please note that my suggestions may not guarantee any performance improvement. However, I would recommend the following step-by-step methods while monitoring for enhancements:

1.Adjust multiscale weights and clustering parameters to see if DER improves. 2.Train MSDD with TitaNet frozen, and adjust msdd_model.parameters.sigmoid_threshold to achieve the lowest DER. 3.Fine-tune or resume training on TitaNet, but with a large enough dataset (such as Voxceleb/Fisher/Swbd) combined with your custom data. In my experience, fine-tuning on a small speaker-recognition dataset (less than 1000 speakers) negatively impacted speaker diarization performance. Since you have pretty big dataset (2500 speakers), mixing additional dataset might help.

Given that I haven't worked with your specific dataset, my suggestions might have limited effectiveness. However, providing more information about your DER, error tendencies, and configuration yaml file would be helpful.

mabajec commented 1 year ago

Dear tango4j,

Thanks for your insights and recommendations. I will experiment as you suggested and let you know if there are any improvements.

Marko

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

gabitza-tech commented 1 year ago

Hello @mabajec,

Sorry for being unrelated to this topic, but I happened to notice you are using CTC Conformer for Slovenian language. I am trying to finetune the English Conformer Transducer Medium model for romanian language.

Could you please provide some information about your training process? (ex: what model size you used, how many epochs you trained, the optimizer, scheduler, batch size)

I am training on a pretty low resources setup and I am pretty tight on time, so any insight is greatly appreciated.

Thanks in advance!!

mabajec commented 1 year ago

Hi @gabitza-tech,

We made many training attempts with different amounts of training material and NN architectures. At the beginning we had less than 100 hours available and WER we achieved with CTC was higher than the one we could get with two phase training (where you train acoustic and language models separately - for instance Kaldi framework). With 1000 hours, however, CTC was already more accurate for generic ASR models. What we figured out and what I suggest to you too is to first follow the receipts that are available on the Nemo GitHub and than staring changing parameters and settings. We used CTC BPE large for the starting point, but it might be better today to start from Fast Conformer as it seems to be able of faster learning. We were always using bf16 for the precision. We usually left the training process until the loss was dropping, but this was never more than few houndred epochs. What is also important (and language specific) is how you do the tokenisation (if you start from BPE or similar models that use tokenisers). You can use different tokenisers and different number of tokens. We started with “sentencepiece unigram”.

I am sure you already studied the sources below. If not, I suggest you read them carefully as they include useful information.

Some useful sources:

https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/ASR_CTC_Language_Finetuning.ipynb https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/examples/kinyarwanda_asr.html

Regards, Marko

gabitza-tech commented 1 year ago

@mabajec Thank you a lot for your insightful response and time!!

Best regards, Gabi

NVIDIA / NeMo

Question: what is the best approach to train speaker diarizer #6815