-
### Feature request
The current approach to tensor parallelism from #5 is not latency optimized. We make an allgather call for every adapter, which will be quite slow for many adapters. Additionally,…
-
## 🐞Describing the bug
Hello. I'm trying to convert PyTorch model to Stateful CoreML Model
I wrote this code referred to [WWDC 2024 session Mistral-7B model](https://github.com/huggingface/swift-t…
-
Hi, because recently I'd like to fine-tune bloom-7b1 by ds-chat using full model parameters, while I find it does not have any supports on pipeline parallel. Do we have any plans on supporting pipeli…
-
### 🐛 Describe the bug
I am trying to use FSDP, but for some reason there is an error when I do model.generate(). MWE below
```
import torch
import os
from omegaconf import DictConfig
from tra…
-
**Is your feature request related to a problem? Please describe.**
Activation prefetch features to enlarge batch size on middle-size(100B~1T) of models
- From DeepSpeedExamples repo, GPU throughput…
-
DeepSpeed Chat use tensor parallelism via hybrid engine to generate sequence in stage3 training.
I wonder if just use zero3 inference for generation is ok? So that we don't need to transform model pa…
-
### Describe the issue
-
### Bug summary
Encountered an issue when using the "descriptor": "dpa2" to train a model from scratch for 500k steps and then testing the model on a merged validation dataset. The merged validatio…
-
`llama.onnx` is primarily used for understanding LLM and converting it to NPU.
If you are looking for inference on Nvidia GPU, we have released lmdeploy at https://github.com/InternLM/lmdeploy.
…
-
When running the notebook for inference using [Llama3](https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb)
```…