MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
2.44k stars 238 forks source link

change alignment library from `whisperx` to `ctc-forced-aligner` #184

Closed MahmoudAshraf97 closed 1 month ago

MahmoudAshraf97 commented 2 months ago

Pros:

Cons

transcriptionstream commented 2 months ago

wondering where the license info for the universal multilingual model can be found

MahmoudAshraf97 commented 2 months ago

Hi @transcriptionstream https://huggingface.co/MahmoudAshraf/mms-300m-1130-forced-aligner I was about to request your review btw :)

transcriptionstream commented 2 months ago

I'll work on getting a build going and test it out. Intrigued by the performance increase. I've got over 30k, and counting, diarizations done for a recent client utilizing the old model - the increase in speed with this model sounds wild and game changing!

transcriptionstream commented 2 months ago

getting the following errors when trying to build this branch

Some weights of the model checkpoint at MahmoudAshraf/mms-300m-1130-forced-aligner were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at MahmoudAshraf/mms-300m-1130-forced-aligner and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

diarize.py 134 <module>
emissions, stride = generate_emissions(

alignment_utils.py 129 generate_emissions
emissions_ = model(input_batch).logits

module.py 1501 _call_impl
return forward_call(*args, **kwargs)

modeling_wav2vec2.py 1969 forward
outputs = self.wav2vec2(

module.py 1501 _call_impl
return forward_call(*args, **kwargs)

modeling_wav2vec2.py 1554 forward
extract_features = self.feature_extractor(input_values)

module.py 1501 _call_impl
return forward_call(*args, **kwargs)

modeling_wav2vec2.py 461 forward
hidden_states = conv_layer(hidden_states)

module.py 1501 _call_impl
return forward_call(*args, **kwargs)

modeling_wav2vec2.py 336 forward
hidden_states = self.conv(hidden_states)

module.py 1501 _call_impl
return forward_call(*args, **kwargs)

conv.py 313 forward
return self._conv_forward(input, self.weight, self.bias)

conv.py 309 _conv_forward
return F.conv1d(input, weight, bias, self.stride,

RuntimeError:
"slow_conv2d_cpu" not implemented for 'Half'
MahmoudAshraf97 commented 2 months ago

you can ignore the first warning, https://github.com/huggingface/transformers/issues/30628 the second error was fixed, the model was set to load in float16 which isn't supported on cpu so I added a device check first

transcriptionstream commented 2 months ago

Thanks! Got it built and am and running it through its paces. So far so good. Trying to get some good benchmarks on speed improvement. Quick tests show it's definitely faster and output is consistent with whisperx. Would love to try it in a prod env if the license can be modified.

MahmoudAshraf97 commented 2 months ago

Unfortunately the license is the decision of the model owners, I just reuploaded it to HF, but you can mitigate that by using another english model that has a suitable license, which works for all languages other than english too because it is the same idea (romanize and normalize all languages to match model vocab) I suggest using https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english btw commercial usage != production usage, so I guess you might revise if your usage is considered commercial or not

transcriptionstream commented 1 month ago

Unfortunately the license is the decision of the model owners, I just reuploaded it to HF, but you can mitigate that by using another english model that has a suitable license, which works for all languages other than english too because it is the same idea (romanize and normalize all languages to match model vocab) I suggest using https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english btw commercial usage != production usage, so I guess you might revise if your usage is considered commercial or not

Any chance you can put me in contact with the model owners? I'd love to ask some questions and see what they'd need to license it for commercial use.

MahmoudAshraf97 commented 1 month ago

Unfortunately the license is the decision of the model owners, I just reuploaded it to HF, but you can mitigate that by using another english model that has a suitable license, which works for all languages other than english too because it is the same idea (romanize and normalize all languages to match model vocab) I suggest using https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english btw commercial usage != production usage, so I guess you might revise if your usage is considered commercial or not

Any chance you can put me in contact with the model owners? I'd love to ask some questions and see what they'd need to license it for commercial use.

These are all relevant links, I don't have direct contact information unfortunately https://llama.meta.com/faq/#legal https://arxiv.org/abs/2305.13516