-
Running into issues when serving Mixtral 8x7B on 4 x H100 (TP=4) with deepspeed-mii v0.2.3 with all other arguments default in the base image from nvidia `nvidia/cuda:12.3.1-devel-ubuntu22.04`
The …
ghost updated
2 months ago
-
**Describe the bug**
I can use my script to finetune model with zero 2 and 3. However, when I use zero infinity offloading parameters. the error occurs:
python: /opt/conda/lib/python3.10/site-pack…
-
**What is your question?**
I have successfully run localcolabfold on our workstation with A4000 GPU. I tried to install it on a cluster with A100 80GB GPU to have more GPU memory. The installation wa…
-
**Describe the bug**
I am trying to pretrain an [Olmo ](https://github.com/allenai/OLMo)1B model on 8 MI 250 GPUs with Docker image: rocm/pytorch:latest (ROCm 6.1). I'm using a small subset of Dolma …
-
I am getting errors while building deepspeed wheel, i set a whole bunch of options to 0 in cmd before since they were also throwing errors it seems, listing them: DS_BUILD_GDS, DS_BUILD_FP_QUANTIZER, …
-
**Describe the bug**
while converting a sharded zero3 checkpoint of llava styled multimodal model, I got the following error
"""
Traceback (most recent call last):
File "/scratch/hongshal/co…
-
**Describe the bug**
reduce scatter cannot be overlap when using zero
**To Reproduce**
DeepSpeed Configs:
```
json = {
"train_batch_size": 64,
"train_micro_batch_size_per_gpu": 1,
…
-
Starting from the code
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1")
It does not work (on A100 python 3.10 and cuda12.1
ImportError: torch_extensions/py310_cu121/ragged_device_ops/ragged_…
-
## Expected Behavior
Batch is able to run through all the queries in a CSV file
## Current Behavior
Stops running at certain sequences that cause an internal issue.
input which caused failur…
-
**Describe the bug**
i have 4 gpus,but i set mp_size=3, it goes wrong
**To Reproduce**
Steps to reproduce the behavior:
```
model_name = "/data/share/rwq/Qwen-7B-Chat"
payload = "你好"
tokeni…