Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
https://llama2-accessory.readthedocs.io/
Other
2.63k stars 168 forks source link

whether full finetune mixtral-8x7B model can be converted to huggingface format for inference with transformer api ?? #139

Open hegang1-tal opened 6 months ago

ChrisLiu6 commented 6 months ago

Theoretically, the answer is yes, but we have yet to write the format conversion scripts. Welcome to contribute.

On the other hand, the MetaModel class in LLaMA2-Accessory has implemented most of the functions needed for inference and evaluation, e.g. the generate and the evaluate_examples methods. Suppose you are worried that multiple processes need to be launched for distributed inference with LLaMA2-Accessory whereas your original inference code was designed for the single-process multi-gpu setting, you may also consider the MultiGpuWrapper class that supports such behavior. Overall, it should be easy to modify your original code working with transformers.AutoModelForCausalLM to work with LLaMA2-Accessory.

kumagai6 commented 3 months ago

I would like to try creating a conversion script. Are there any points I should be aware of? Will the conversion code for something created with mixtral_sparse be different from something created with mixtral?

ChrisLiu6 commented 3 months ago

I would like to try creating a conversion script. Are there any points I should be aware of? Will the conversion code for something created with mixtral_sparse be different from something created with mixtral?

They will be different. Say you use 8-way model parallelism, the base implementation will have each rank hold one expert, while the sparse implementation will let each rank hold 1/8 of each of the 8 experts. Therefore, the parameters are named and organized differently in these two implementations, so the conversion logics should be different