Possible SSM-Transformers implementation?

alxndrTL / mamba.py

A simple and efficient Mamba implementation in pure PyTorch and MLX.

MIT License

960 stars 86 forks source link

Possible SSM-Transformers implementation? #18

Open severian42 opened 6 months ago

severian42 commented 6 months ago

Hey! Awesome work on this project! I know it's not technically vanilla Mamba but I've been trying to convert the new SSM-Transformers Jamba into MLX for more efficient training and usability but am having a difficult time. My specialty is in the training/datasets world and not the strongest in the core math behind the model architectures beyond the basic implementations.

Would somebody know of an easier way to get Jamba converted into MLX? I truly think Jamba has A LOT to offer and could do some awesome stuff in the MLX format and for local model training with Mac

I've provided the modeling script released by AI21 for quick reference. Is this feasible or just way too complicated at the moment?

modeling_jamba.txt

alxndrTL commented 6 months ago

Hello, I think running Jamba with MLX would be possible and not to hard with mamba.py. It is already possible to load and run a pre-trained Mamba model in MLX with mamba.py, adding attention layers is just another step! There are two things to point out : -at the moment, the MLX version of mamba.py uses (at least at inference) a lot of memory (possibly due to the fact that depthwise 1d convolution is not available with MLX as of now so it must be done manually) -training is slow compared to the torch version due to the way MLX operates on arrays (as of now)

But I think this would still worth it! I'll start thinking about it and see what I can do

EDIT : there is also the MoE part of Jamba that is new compared to mamba.py

severian42 commented 6 months ago

Thank you so much for your input! I truly appreciate it as this area is out of my wheelhouse but I'm trying to learn as much as possible

I thought it may be hard with the current MLX capabilities but seemed like most could possibly be implemented from the Jamba version of its MambaBlock

The MoE does muddy the water a bit, which also threw me for a loop vs the normal MoE implementation (unless I was overthinking it)

Thank you for being willing to take a look and see what's possible. You seem to have such a great grasp on Mamba. I'll keep messing around on my end and see if I can get any further

severian42 commented 6 months ago

Just wanted to say THANK YOU so much for tackling Jamba. I've been trying on my own with horrible results haha I really appreciate the hard work you are putting in to get it working. I have a lot of faith in this model and its potential to harness MLX like no other

Let me know if I can 'Buy you a Coffee' or something!

alxndrTL commented 6 months ago

Thank you for your encouraging message! FIY I'm almost done having a simple implementation of Jamba in PyTorch (just like in the mamba.py file) Then I will tackle the PyTorch -> MLX conversion, which shouldn't be very hard.

That is really nice of you but I'm ok for now! (you can follow the progress in the jamba branch)