Closed Klassikcat closed 2 years ago
Did you use VAD from ASR or native nemo VAD?
@nithinraok I've used VAD from ASR(Conformer-CTC-BPE).
Can you switch it to use native vad vad_multilingual_marblenet
.
Or if you would like to use VAD from ASR ... you could try changing asr_based_vad_threshold
to 1.0
@nithinraok thanks for your help, but both of them were not worked.
However, I found an interesting things in thing in speaker_outputs folder(Output directory of Speaker Embedding)
All subsegments_scale.json file's labels were "UNK" and all of uniq_id were null.
could it be related to the problem that I'm facing?
from omegaconf import OmegaConf
pretrained_vad = 'vad_multilingual_marblenet'
pretrained_speaker_model = os.path.join(os.getcwd(), 'nemo_experiments', 'TitaNet', '2022-10-17_19-48-25', 'checkpoints', 'TitaNet.nemo')
cfg = OmegaConf.load(os.path.join(os.getcwd(), 'model_cards', 'diarization', 'diar_infer_telephonic.yaml'))
cfg.num_workers = 1
cfg.diarizer.manifest_filepath = os.path.join(os.getcwd(), 'input_manifest.json')
cfg.diarizer.out_dir = 'data/' # Directory to store intermediate files and prediction outputs
cfg.diarizer.speaker_embeddings.model_path = pretrained_speaker_model
cfg.diarizer.oracle_vad = False # compute VAD provided with model_path to vad config
cfg.diarizer.clustering.parameters.oracle_num_speakers=False
#Here we use our inhouse pretrained NeMo VAD
cfg.diarizer.vad.model_path = pretrained_vad
cfg.diarizer.vad.parameters.onset = 0.8
cfg.diarizer.vad.parameters.offset = 0.6
cfg.diarizer.vad.parameters.pad_offset = -0.05
from nemo.collections.asr.models import ClusteringDiarizer
sd_model = ClusteringDiarizer(cfg=cfg)
sd_model.diarize()
from omegaconf import OmegaConf
cfg = OmegaConf.load(os.path.join(os.getcwd(), 'model_cards', 'diarization', 'diar_infer_telephonic.yaml'))
cfg.diarizer.manifest_filepath = os.path.join(os.getcwd(), 'input_manifest.json')
cfg.diarizer.speaker_embeddings.model_path = os.path.join(os.getcwd(), 'nemo_experiments', 'TitaNet', '2022-10-17_19-48-25', 'checkpoints', 'TitaNet.nemo')
cfg.diarizer.clustering.parameters.max_num_speakers = 8
cfg.diarizer.asr.model_path = os.path.join(os.getcwd(), 'checkpoints', 'conformer', 'Conformer-CTC-BPE.nemo')
cfg.diarizer.out_dir = os.getcwd()
cfg.diarizer.asr. asr_based_vad_threshold = 1.0
{"audio_filepath": "/home/insutil/codes/NeMo/data/sample.wav", "offset": 239.87, "duration": 0.5, "label": "UNK", "uniq_id": null}
{"audio_filepath": "/home/insutil/codes/NeMo/data/sample.wav", "offset": 240.12, "duration": 0.5, "label": "UNK", "uniq_id": null}
{"audio_filepath": "/home/insutil/codes/NeMo/data/sample.wav", "offset": 240.37, "duration": 0.5, "label": "UNK", "uniq_id": null}
{"audio_filepath": "/home/insutil/codes/NeMo/data/sample.wav", "offset": 240.62, "duration": 0.5, "label": "UNK", "uniq_id": null}
{"audio_filepath": "/home/insutil/codes/NeMo/data/sample.wav", "offset": 240.87, "duration": 0.28999999999999204, "label": "UNK", "uniq_id": null}
{"audio_filepath": "/home/insutil/codes/NeMo/data/sample.wav", "offset": 242.14, "duration": 0.37999999999999545, "label": "UNK", "uniq_id": null}
{"audio_filepath": "/home/insutil/codes/NeMo/data/sample.wav", "offset": 243.34, "duration": 0.30000000000001137, "label": "UNK", "uniq_id": null}
Train config :
manifest_filepath: train_speakers.json
sample_rate: 16000
labels:
- 8
- 191
- 236
- 366
- 423
- 432
- 575
- 624
- 776
- 922
- 1053
- 1057
- 1123
- 1136
- 1221
- 1330
- 1347
- 1379
- 1462
- 1553
- 1625
- 1667
- 1718
- 1964
- 1989
- 2062
- 2104
- 2122
- 2131
- 2177
- 2254
- 2357
- 2556
- 2615
- 2662
- 2674
- 2704
- 2758
- 2805
- 2987
- 3007
- 3070
- 3093
- 3159
- 3499
- 3544
- 3548
- 3562
- 3573
- 3580
- 3604
- 3664
- 3715
- 3740
- 3775
- 3791
- 3858
- 3988
- 4055
- 9036
batch_size: 32
shuffle: true
is_tarred: false
tarred_audio_filepaths: null
tarred_shard_strategy: scatter
augmentor:
speed:
prob: 0.3
sr: 16000
resample_type: kaiser_fast
min_speed_rate: 0.95
max_speed_rate: 1.05
Validation config :
manifest_filepath: eval_speakers.json
sample_rate: 16000
labels: null
batch_size: 32
shuffle: false
It turns out that there are some problems with weights or settings in my YAML. using NCG TitaNet with fine-tuned conformer works fine when cfg.diarizer.asr.parameters.asr_based_vad_threshold = 1.0 even for Korean. thanks for your help @nithinraok
Hello @Klassikcat @nithinraok Could you give me some update on this issue on how it was solved? I am facing the same issue with my speaker recognition model. I finetuned the titanet model and with batch inference on multiple audio files in the test set the accuracy is around 95%, but the model returns single label for each audio file when tested separate using the get_label() function.
@Sreeni1204
The issue you're facing seems to issue of yaml configuration in light of issue i've faced before. I think the issue is related to loss and vad threshold. For the threshold, you should set to 1.0. For the loss, there is comment that how to set the loss in speaker embedding. Default setting is for speaker verification.
If that doesn't work, try titanet-large-en weight from ngc without fine-tuning. It worked for Korean voice with fine-tuned conformer, so it would work in Japanese, English, and other languages as well.
Describe the bug
I've fine-tuned TitaNet-Large model 10 Epoch for Korean with 1,000,000 datas and 60 speaker dataset because NGC TitaNet(for English) only predict one token. I've checked loss has decreased not only in training step, but also in validation step(min loss was 0.0079).
However, the fine-tuned model only predict one token, <speaker 0>:
Since label in the manifest file is integer, it seems it is a label-related issue. But there is a few reasons i can assume.
+ Additional Information
Diarization score is None in the below code(in the ASR_with_SpeakerDiarization.ipynb)
Steps/Code to reproduce bug
Example of manifest file for training:
{"audio_filepath":"\/home\/me\/datas\/speaker_tasks\/datas\/1.Training\/original\/call\/2021-12-16\/3544\/A0210-3544M2010-11020010-06897809.wav","label":3544,"offset":0,"duration":1.38}
YAML Configuration for training
YAML Configuration for prediction
prediction: Uses ASR_with_SpeakerDiarization.ipynb notebook in tutorials/speaker_tasks
Expected behavior
Environment overview (please complete the following information)
Environment details