collabora / WhisperSpeech

An Open Source text-to-speech system built by inverting Whisper.
https://collabora.github.io/WhisperSpeech/
MIT License
3.97k stars 216 forks source link

possibly use MLX for MacOS users with WhisperSpeech #111

Open BBC-Esq opened 8 months ago

BBC-Esq commented 8 months ago

The purpose is the discuss possibly implementing MLX support for MacOS users. For example, currently Pytorch doesn't support the FFT operation whereas MLX does. This means that WhisperSpeech must put certain models and/or tensors etc. on CPU for MacOs users...whereas CUDA users have complete speedup.

Possibly Compatible with the operator WhisperSpeech uses ![image](https://github.com/collabora/WhisperSpeech/assets/108230321/d4edba46-d906-4814-926c-04112a823b85)

Option 1 - implement MLX just where MPS can't be used Option 2 - completely replace MPS with MLX Option 3 - replace MPS with MLX as much as possible based on the multiple models involved in WhisperSpeech and whether each specifically can be run with MLX. Option 4 - offer MLX IN ADDITION to MPS for all MacOs users.

In general, MLX provides a 2-3x speedup compared to MPS across the board in most cases.

Here are some snippets from the Medium article:

Benchmark Setup ![image](https://github.com/collabora/WhisperSpeech/assets/108230321/0d08be9b-75a4-4761-b8ba-9b7dbf5e9415)
Linear Layer ![image](https://github.com/collabora/WhisperSpeech/assets/108230321/47c49246-beeb-40e3-8f29-cd4828ff24d4)
Softmax ![image](https://github.com/collabora/WhisperSpeech/assets/108230321/291ee76f-2ba3-4a9f-ab27-a0d24eb884d6)
Sigmoid ![image](https://github.com/collabora/WhisperSpeech/assets/108230321/d9d15802-1dcc-4792-b014-41d72b1c54aa)
Concatenation ![image](https://github.com/collabora/WhisperSpeech/assets/108230321/f357425f-1bd0-4571-931b-5a9bc5da27de)
Binary Cross Entropy ![image](https://github.com/collabora/WhisperSpeech/assets/108230321/7871ad9a-ce86-4981-bdc1-b5a2770808d5)
Sort ![image](https://github.com/collabora/WhisperSpeech/assets/108230321/86e8bd16-da86-412c-8fb8-500a76503728)
Conv2D ![image](https://github.com/collabora/WhisperSpeech/assets/108230321/2d3e1f06-a500-4e4b-bcde-ec13ec37ccd8)
Unified Memory Gamechanger ![image](https://github.com/collabora/WhisperSpeech/assets/108230321/8712e6ae-0588-4a8e-91e4-c3b48b77bc1d)

MLX:

https://github.com/ml-explore/mlx

MLX Examples:

https://github.com/ml-explore/mlx-examples/tree/main/llms/llama

MLX Community:

https://huggingface.co/mlx-community

MLX Bark:

https://huggingface.co/mlx-community/mlx_bark (would beat all current implementations in WhisperSpeech currently, in MPS as far as speed that is)

Sample MLX Whisper Script:

https://github.com/ml-explore/mlx-examples/blob/main/whisper/whisper/transcribe.py

Example MLX Whisper model:

https://huggingface.co/mlx-community/whisper-large-v2-mlx

signalprime commented 8 months ago

Great initiative @BBC-Esq ! I'll definitely circle back to this one as soon as possible

BBC-Esq commented 8 months ago

@signalprime It would take someone with more programming experience than me to implement, especially since I don't own a Mac, but thought I'd start the discussion anyways. Interested as always in what you find out.

BBC-Esq commented 8 months ago

UPDATE: Looks like Pytorch might be getting support sooner than later...

https://github.com/pytorch/pytorch/commit/53bfae2c066fcd06784dfa051cd7e2eb5ba5c8fa

signalprime commented 8 months ago

I'm definitely looking into it. Reviewing the Vocos model today

BBC-Esq commented 8 months ago

I'm definitely looking into it. Reviewing the Vocos model today

I'd love to learn if you want to keep me posted and teach me along the way, just FYI. This is not my profession but a hobby.

signalprime commented 8 months ago

Absolutely @BBC-Esq, I will keep you in the loop about it. MLX mimics the pytorch API in most ways. I've been building models since before we had frameworks like TF and Torch, and in this case I'll be rebuilding the Vocos model using the MLX library. It just depends on time constraints.

I recently finished a long project with ML/RL in the finance domain and put in an application with Collabora last week. Would you put in a nice word for me @jpc?

signalprime commented 8 months ago

I'm getting closer.. almost reached the end of the hole. We have a standard whisper model for MLX already established.

I was able to convert the Vocos model and weights to MLX, however ran into many issues with its feature extractor. MLX doesn't have weight_norm established yet. I've dug into the code, and debating when I have time to add the _weight_norm primitive to the C++ MLX library

https://github.com/pytorch/pytorch/blob/834c7a1d3ea07878ad87d127ee28606fc140b552/aten/src/ATen/native/WeightNorm.cpp#L50

I'd like to do a little more research before trying that because it could perhaps be handled another way, or not needed at all, kinda like a quick initial pass-through. I removed those references and there are some other issue, kinda out of energy for this today.

BBC-Esq commented 8 months ago

Interesting...

signalprime commented 8 months ago

Good thing I waited. I got a response that it should be possible using existing ops.

Here is the whisper model in MLX format, which is used during voice cloning.

I was working with MLX conversions for all the parts of the Vocos model. Transferring weights wasn't an issue, but components used in the functions also likely need to be updated. I'm still becoming familiar, but it seems parts can be mixed and matched.. as in a tensor can be converted to an MX array and passed to a MX component and back to a tensor later. That would appear necessary since I wouldn't want to keep going further and further into torchaudio for example. Ideally we just put a replacement for the components where torch doesn't yet support the ops.

BBC-Esq commented 8 months ago

Good thing I waited. I got a response that it should be possible using existing ops.

Here is the whisper model in MLX format, which is used during voice cloning.

I was working with MLX conversions for all the parts of the Vocos model. Transferring weights wasn't an issue, but components used in the functions also likely need to be updated. I'm still becoming familiar, but it seems parts can be mixed and matched.. as in a tensor can be converted to an MX array and passed to a MX component and back to a tensor later. That would appear necessary since I wouldn't want to keep going further and further into torchaudio for example. Ideally we just put a replacement for the components where torch doesn't yet support the ops.

That's what my intuition was telling me based on what I read about MLX, but I am far from an expert and would have no way to verify it. My initial hypothesis was that it might be possible to use MLX for some (but not all) of the necessary operations (or whatever you call them), kind of mix and match like you were saying. Math is math...but again, this it totally a notice intuition kind of thing.

Let me know if I can help out any...

BBC-Esq commented 8 months ago

Not sure if it's relevant, but apparently aten::Lupsample_linear1d has been implemented on pytorch's working version (not included in a release yet though):

https://github.com/pytorch/pytorch/pull/116630#issuecomment-1965380887

BBC-Esq commented 8 months ago

@signalprime how's it going? Any updates?

signalprime commented 8 months ago

Hi @BBC-Esq I haven't had an opportunity to resume work on this unfortunately, my friend

BBC-Esq commented 8 months ago

Hey @signalprime I hope you don't stop working on this kind of stuff even if you don't get the job with Collabora. I enjoy working with ya and look forward to improving this all-around kick ass library. Just throwing that out there!

signalprime commented 8 months ago

Likewise @BBC-Esq, I'll keep it on my mind and make time to return to the effort. Definitely not related to Collabora, rather the launch of another project, meetings, and the occasional things that pull us away from our desks. At the next go, I'll try the mixed approach where rather than converting everything to MLX we just use MLX ops for those times where coverage is still missing in torch. If that works it should keep things more simple. I've been spending a lot of time working with autonomous agents, and giving them a good voice, using whatever style we prefer, is an important feature.

jpc commented 8 months ago

@signalprime Sure, I'll see what I can do :)

jpc commented 8 months ago

@signalprime Btw. do you have a Discord? Maybe we could have a chat there?

signalprime commented 8 months ago

@jpc yes absolutely, I sent you an email with details. Looking forward to it!

touhi99 commented 6 months ago

is it still working with MPS? i couldn't make it run the current main branch, its use CPU only.