Closed philschmid closed 10 months ago
we'd love to, but it requires a slight change (sliding window attention). we can have a look.
as it's a rather small model for now, not sure if we should prioritize mistral or falcon 180B first, what do you think?
We are seeing a lot of interest from the HF community training Mistral, even if it is 7B. The question I guess is, would epfLLM
improve fine-tuning/continuous pertaining of the model, e.g. be faster more efficient.
If not the Falcon 180B is probably the right priority
+1
A potential starting point could be this Mistral implementation: https://github.com/PygmalionAI/aphrodite-engine/blob/12e296b55675d5784acb69d736189ae0a9ca40a8/aphrodite/modeling/models/mistral.py
I tried to add a preliminary Mistral implementation here (https://github.com/epfLLM/Megatron-LLM/pull/88#issue-1988719134). It currently relies on the latest version of FlashAttention for Windowed Attention, although the window attention will only be used when seq len > 4096 which i currently don't have enough memory to test. Feel free to give it a try / test it!
thank you so much, this looks great.
@AleHD , @kylematoba , @mkrima , @mpagli could one of you have a look at the PR #88 ?
taking a look
closed by #88 #90
Are you planning to add support for Mistral?