Open wheresmyhair opened 1 week ago
Note on multiple instances inference: In vllm inference, the number of attn heads should be devisible by vllm tensor parallel size. If we have a 14 heads LLM, then the options for tp is 1&2 (7 will cause another division issue, but I just forget what that issue is). Say we have 8 gpus, then to utilize these devices, multiple instances vllm inference is necessary (tp=1 -> 8 instances, and tp=2 -> 4 instances) Also, same for rm inference, and any other inference pipelines.
This document includes the features in LMFlow's roadmap. We welcome any discuss or contribute to the specific features at related Issues/PRs. 🤗
Main Features
Usability
vllm
package optionalhf_model_mixin
Bug fixes
model.generate()
with dsz3 #861merge_lora
lora with abs path mergingIssues left over from history
use_accelerator
->use_accelerate
typo fix (with Accelerate support PR)model_args.use_lora
leads to truncation of the sequence, mentioned in #867Documentation