-
**Describe the bug**
StartCoder inference with AutoTP doesn't work.
I get the following error:
```
File "[...]/venv38/lib64/python3.8/site-packages/transformers/models/gpt_bigcode/modeling_gpt_b…
Epliz updated
8 months ago
-
**Describe the bug**
I installed deepspeed with pip install deepspeed and tried to use DeepSpeedCPUAdam but with this error
```
Exception ignored in:
Traceback (most recent call last):
File …
-
auto.json:
```
{
"train_micro_batch_size_per_gpu": "auto",
"fp16": {
"enabled": true
},
"autotuning": {
"enabled": true,
"fast": false,
"overwrite": t…
-
**Describe the bug**
when i run train,rlhf step 3;
```
Actor_Lr=9.65e-6
Critic_Lr=5e-6
#--data_path Dahoas/rm-static \
#--offload_reference_model \
deepspeed --master_port 12346 main_step3.py…
-
**Describe the bug**
Deepspeed got segfault when loading CPU_ADAM, both with zero-2 and zero-3 configs / Huggingface transformers integration.
**Zero Configuations**
- Zero-2
```
{
"fp16":…
-
**Describe the bug**
I'm currently using the HF Trainer for training, with the HF learning rate scheduler and DeepSpeed optimizer. I've encountered an issue with loading universal checkpoints. The HF…
-
**Describe the bug**
When I upgrade to DeepSpeed 0.14.3, training does not progress because all gradients and gradient norms are zero. From using git bisect, I think it's from this PR:
https://git…
-
**Describe the bug**
For ZeRO-3, i'm noticing an increase in training times on g5.48xlarge nodes with torch >= 2.3.1 and CUDA 12.1. I can reproduce this with small and large models, and in some cases…
-
After installing deepspeed 0.15.0 via pip3, I ran ds_report to get compatibility of various features.
I get the following messages when looking for GDS compatibility:
```
[2024-08-29 15:16:37,…
-
**Describe the bug**
when train [llama-vid](https://github.com/dvlab-research/LLaMA-VID) (stage2, full-finetuning LLaMA) using deepspeed==0.14.0, and transformers trainer, grad_norm will be nan (or 1…