EricLBuehler / mistral.rs

Blazingly fast LLM inference.
MIT License
3.44k stars 248 forks source link

Phi-3.5-MoE-instruct unsupported #760

Closed erikpro008 closed 1 week ago

erikpro008 commented 1 week ago

Describe the bug

running this in terminal : "./mistralrs-server --isq Q4K -i plain -m microsoft/Phi-3.5-MoE-instruct -a phi3.5moe"

I am unable to run this because out of memory it gets over 64 GB of ram, which makes it crash while processing, I thought I would not require so much with the ISQ feature, but it seems to be a lie or misleading information, why does it require so much ram ?

is there a way to fix this and make it work, because else I might need to wait for llama.cpp support.

i am using a MacBook Pro M1 Max with 64 GB Ram.

Latest commit or version

Which commit or version you ran with.

Latest version that is currently available.

image image
erikpro008 commented 1 week ago

It seems like if you have enough hard drive memory it will work, but always requiring this amount of memory makes it harder to use.

EricLBuehler commented 1 week ago

I am unable to run this because out of memory it gets over 64 GB of ram, which makes it crash while processing, I thought I would not require so much with the ISQ feature, but it seems to be a lie or misleading information, why does it require so much ram ?

@erikpro008 it loads the model onto the CPU first, and then applies ISQ directly onto the GPU (or CPU). It seems like you are using swap space now, which should work. Please let me know if you have any further questions!