Open hegang1-tal opened 10 months ago
I would like to try creating a conversion script. Are there any points I should be aware of? Will the conversion code for something created with mixtral_sparse be different from something created with mixtral?
I would like to try creating a conversion script. Are there any points I should be aware of? Will the conversion code for something created with mixtral_sparse be different from something created with mixtral?
They will be different. Say you use 8-way model parallelism, the base implementation will have each rank hold one expert, while the sparse implementation will let each rank hold 1/8 of each of the 8 experts. Therefore, the parameters are named and organized differently in these two implementations, so the conversion logics should be different
Theoretically, the answer is yes, but we have yet to write the format conversion scripts. Welcome to contribute.
On the other hand, the
MetaModel
class in LLaMA2-Accessory has implemented most of the functions needed for inference and evaluation, e.g. thegenerate
and theevaluate_examples
methods. Suppose you are worried that multiple processes need to be launched for distributed inference with LLaMA2-Accessory whereas your original inference code was designed for the single-process multi-gpu setting, you may also consider theMultiGpuWrapper
class that supports such behavior. Overall, it should be easy to modify your original code working with transformers.AutoModelForCausalLM to work with LLaMA2-Accessory.