-
Hi, because recently I'd like to fine-tune bloom-7b1 by ds-chat using full model parameters, while I find it does not have any supports on pipeline parallel. Do we have any plans on supporting pipeli…
-
Hi, I want to run one LLM model using multiple machines.
On one node, I want to use tensor parallel to speedup.
Within multiple nodes, I want to use pipeline parallel.
Is this supported? If s…
-
Hello!
I'm currently checking flash attention v2 and noticed that when copying from global memory to shared memory, the entire HeadDim (the K dimension in MNK tiling) needs to be copied to shared m…
-
### System Info
```shell
using Huggingface AMI from AWS marketplace with Ubuntu 22.04
optimum-neuron 0.0.25
transformers 4.45.2
peft 0.13.0
trl 0.11.4
accelerate 0.29.2
torch 2.1.2
```
…
-
**Describe the bug**
I am tryting to do batch inference, so the inputs needs padding. When using `replace_with_kernel_inject=True`, the engine output is incorrect. setting `replace_with_kernel_inject…
-
`llama.onnx` is primarily used for understanding LLM and converting it to NPU.
If you are looking for inference on Nvidia GPU, we have released lmdeploy at https://github.com/InternLM/lmdeploy.
…
-
### Describe the feature
**Problem**
The intrahost [microbenchmarking CLI tool](https://colossalai.org/docs/basics/command_line_tool/#tensor-parallel-micro-benchmarking) executes the "None" (DDP) st…
-
### Motivation.
As a continuation to #5367 - as this merge request was rejected and I have to maintain my own fork to support this scenario, I suggest we should add support in vLLM for model architec…
-
DeepSpeed Chat use tensor parallelism via hybrid engine to generate sequence in stage3 training.
I wonder if just use zero3 inference for generation is ok? So that we don't need to transform model pa…
-
Hello,
I am encountering an issue while testing FlexFlow's LLM module. Below is the code I am using:
`import flexflow.serve as ff
import time
ff.init(
num_gpus=1,
memory_per_gpu=2200…