whether full finetune mixtral-8x7B model can be converted to huggingface format for inference with transformer api ??

ChrisLiu6 commented 10 months ago

Theoretically, the answer is yes, but we have yet to write the format conversion scripts. Welcome to contribute.

On the other hand, the MetaModel class in LLaMA2-Accessory has implemented most of the functions needed for inference and evaluation, e.g. the generate and the evaluate_examples methods. Suppose you are worried that multiple processes need to be launched for distributed inference with LLaMA2-Accessory whereas your original inference code was designed for the single-process multi-gpu setting, you may also consider the MultiGpuWrapper class that supports such behavior. Overall, it should be easy to modify your original code working with transformers.AutoModelForCausalLM to work with LLaMA2-Accessory.

kumagai6 commented 7 months ago

I would like to try creating a conversion script. Are there any points I should be aware of? Will the conversion code for something created with mixtral_sparse be different from something created with mixtral?

ChrisLiu6 commented 7 months ago

I would like to try creating a conversion script. Are there any points I should be aware of? Will the conversion code for something created with mixtral_sparse be different from something created with mixtral?

They will be different. Say you use 8-way model parallelism, the base implementation will have each rank hold one expert, while the sparse implementation will let each rank hold 1/8 of each of the 8 experts. Therefore, the parameters are named and organized differently in these two implementations, so the conversion logics should be different

Alpha-VLLM / LLaMA2-Accessory

whether full finetune mixtral-8x7B model can be converted to huggingface format for inference with transformer api ?? #139