Open BBC-Esq opened 8 months ago
Great initiative @BBC-Esq ! I'll definitely circle back to this one as soon as possible
@signalprime It would take someone with more programming experience than me to implement, especially since I don't own a Mac, but thought I'd start the discussion anyways. Interested as always in what you find out.
UPDATE: Looks like Pytorch might be getting support sooner than later...
https://github.com/pytorch/pytorch/commit/53bfae2c066fcd06784dfa051cd7e2eb5ba5c8fa
I'm definitely looking into it. Reviewing the Vocos model today
I'm definitely looking into it. Reviewing the Vocos model today
I'd love to learn if you want to keep me posted and teach me along the way, just FYI. This is not my profession but a hobby.
Absolutely @BBC-Esq, I will keep you in the loop about it. MLX mimics the pytorch API in most ways. I've been building models since before we had frameworks like TF and Torch, and in this case I'll be rebuilding the Vocos model using the MLX library. It just depends on time constraints.
I recently finished a long project with ML/RL in the finance domain and put in an application with Collabora last week. Would you put in a nice word for me @jpc?
I'm getting closer.. almost reached the end of the hole. We have a standard whisper model for MLX already established.
I was able to convert the Vocos model and weights to MLX, however ran into many issues with its feature extractor. MLX doesn't have weight_norm
established yet. I've dug into the code, and debating when I have time to add the _weight_norm
primitive to the C++ MLX library
I'd like to do a little more research before trying that because it could perhaps be handled another way, or not needed at all, kinda like a quick initial pass-through. I removed those references and there are some other issue, kinda out of energy for this today.
Interesting...
Good thing I waited. I got a response that it should be possible using existing ops.
Here is the whisper model in MLX format, which is used during voice cloning.
I was working with MLX conversions for all the parts of the Vocos model. Transferring weights wasn't an issue, but components used in the functions also likely need to be updated. I'm still becoming familiar, but it seems parts can be mixed and matched.. as in a tensor can be converted to an MX array and passed to a MX component and back to a tensor later. That would appear necessary since I wouldn't want to keep going further and further into torchaudio for example. Ideally we just put a replacement for the components where torch doesn't yet support the ops.
Good thing I waited. I got a response that it should be possible using existing ops.
Here is the whisper model in MLX format, which is used during voice cloning.
I was working with MLX conversions for all the parts of the Vocos model. Transferring weights wasn't an issue, but components used in the functions also likely need to be updated. I'm still becoming familiar, but it seems parts can be mixed and matched.. as in a tensor can be converted to an MX array and passed to a MX component and back to a tensor later. That would appear necessary since I wouldn't want to keep going further and further into torchaudio for example. Ideally we just put a replacement for the components where torch doesn't yet support the ops.
That's what my intuition was telling me based on what I read about MLX, but I am far from an expert and would have no way to verify it. My initial hypothesis was that it might be possible to use MLX for some (but not all) of the necessary operations (or whatever you call them), kind of mix and match like you were saying. Math is math...but again, this it totally a notice intuition kind of thing.
Let me know if I can help out any...
Not sure if it's relevant, but apparently aten::Lupsample_linear1d has been implemented on pytorch's working version (not included in a release yet though):
https://github.com/pytorch/pytorch/pull/116630#issuecomment-1965380887
@signalprime how's it going? Any updates?
Hi @BBC-Esq I haven't had an opportunity to resume work on this unfortunately, my friend
Hey @signalprime I hope you don't stop working on this kind of stuff even if you don't get the job with Collabora. I enjoy working with ya and look forward to improving this all-around kick ass library. Just throwing that out there!
Likewise @BBC-Esq, I'll keep it on my mind and make time to return to the effort. Definitely not related to Collabora, rather the launch of another project, meetings, and the occasional things that pull us away from our desks. At the next go, I'll try the mixed approach where rather than converting everything to MLX we just use MLX ops for those times where coverage is still missing in torch. If that works it should keep things more simple. I've been spending a lot of time working with autonomous agents, and giving them a good voice, using whatever style we prefer, is an important feature.
@signalprime Sure, I'll see what I can do :)
@signalprime Btw. do you have a Discord? Maybe we could have a chat there?
@jpc yes absolutely, I sent you an email with details. Looking forward to it!
is it still working with MPS? i couldn't make it run the current main branch, its use CPU only.
The purpose is the discuss possibly implementing MLX support for MacOS users. For example, currently Pytorch doesn't support the FFT operation whereas MLX does. This means that WhisperSpeech must put certain models and/or tensors etc. on CPU for MacOs users...whereas CUDA users have complete speedup.
Possibly Compatible with the operator WhisperSpeech uses
![image](https://github.com/collabora/WhisperSpeech/assets/108230321/d4edba46-d906-4814-926c-04112a823b85)Option 1 - implement MLX just where MPS can't be used Option 2 - completely replace MPS with MLX Option 3 - replace MPS with MLX as much as possible based on the multiple models involved in WhisperSpeech and whether each specifically can be run with MLX. Option 4 - offer MLX IN ADDITION to MPS for all MacOs users.
In general, MLX provides a 2-3x speedup compared to MPS across the board in most cases.
Here are some snippets from the Medium article:
Benchmark Setup
![image](https://github.com/collabora/WhisperSpeech/assets/108230321/0d08be9b-75a4-4761-b8ba-9b7dbf5e9415)Linear Layer
![image](https://github.com/collabora/WhisperSpeech/assets/108230321/47c49246-beeb-40e3-8f29-cd4828ff24d4)Softmax
![image](https://github.com/collabora/WhisperSpeech/assets/108230321/291ee76f-2ba3-4a9f-ab27-a0d24eb884d6)Sigmoid
![image](https://github.com/collabora/WhisperSpeech/assets/108230321/d9d15802-1dcc-4792-b014-41d72b1c54aa)Concatenation
![image](https://github.com/collabora/WhisperSpeech/assets/108230321/f357425f-1bd0-4571-931b-5a9bc5da27de)Binary Cross Entropy
![image](https://github.com/collabora/WhisperSpeech/assets/108230321/7871ad9a-ce86-4981-bdc1-b5a2770808d5)Sort
![image](https://github.com/collabora/WhisperSpeech/assets/108230321/86e8bd16-da86-412c-8fb8-500a76503728)Conv2D
![image](https://github.com/collabora/WhisperSpeech/assets/108230321/2d3e1f06-a500-4e4b-bcde-ec13ec37ccd8)Unified Memory Gamechanger
![image](https://github.com/collabora/WhisperSpeech/assets/108230321/8712e6ae-0588-4a8e-91e4-c3b48b77bc1d)MLX:
https://github.com/ml-explore/mlx
MLX Examples:
https://github.com/ml-explore/mlx-examples/tree/main/llms/llama
MLX Community:
https://huggingface.co/mlx-community
MLX Bark:
https://huggingface.co/mlx-community/mlx_bark (would beat all current implementations in WhisperSpeech currently, in MPS as far as speed that is)
Sample MLX Whisper Script:
https://github.com/ml-explore/mlx-examples/blob/main/whisper/whisper/transcribe.py
Example MLX Whisper model:
https://huggingface.co/mlx-community/whisper-large-v2-mlx