-
**Is your feature request related to a problem? Please describe.**
Activation prefetch features to enlarge batch size on middle-size(100B~1T) of models
- From DeepSpeedExamples repo, GPU throughput…
-
Scaling models requires they be trained in data-parallel, pipeline parallel, or tensor parallel regimes. The last two, being both "model parallel", require a single model to be shared across GPUs. Thi…
-
**Describe the bug**
I am tryting to do batch inference, so the inputs needs padding. When using `replace_with_kernel_inject=True`, the engine output is incorrect. setting `replace_with_kernel_inject…
-
### Motivation.
As a continuation to #5367 - as this merge request was rejected and I have to maintain my own fork to support this scenario, I suggest we should add support in vLLM for model architec…
-
Default deepspeed config for config_block_10B.json is zero-2, when i change it to zero-3, i got a mismatch error. Is there a way to use zero-3 (load param to cpu offload)?
In addition, if i only have…
-
Hi, because recently I'd like to fine-tune bloom-7b1 by ds-chat using full model parameters, while I find it does not have any supports on pipeline parallel. Do we have any plans on supporting pipeli…
-
Hi, I want to run one LLM model using multiple machines.
On one node, I want to use tensor parallel to speedup.
Within multiple nodes, I want to use pipeline parallel.
Is this supported? If s…
-
DeepSpeed Chat use tensor parallelism via hybrid engine to generate sequence in stage3 training.
I wonder if just use zero3 inference for generation is ok? So that we don't need to transform model pa…
-
I want to use te's comm-gemm-overlap module to perform multi-node training, however the readme says this module only support single node. Does te have any plan for multi nodes support? And what effort…
-
`llama.onnx` is primarily used for understanding LLM and converting it to NPU.
If you are looking for inference on Nvidia GPU, we have released lmdeploy at https://github.com/InternLM/lmdeploy.
…