-
**Describe the bug**
I use zero1 to train a unet network with the following deepspeed_config configuration. I set 10 epochs and the output during training is as follows:
```json
{
"train_m…
-
**Describe the bug**
I'm trying to use deepspeed to finetune a bert based classification model, but when trying to launch multi-node training all nodes include localhost get errno: 110 - Connection t…
-
Hi,
I'm using the released training data on AWS and the latest main branch to train the model.
1. The directory structure of the released data is not recognized by the code.
2. After re-struc…
-
**Describe the bug**
I was trying to run an inference with DeepSpeed on the Llama model, but when I ran `deepspeed --num_gpus 4 script.py`, the process terminated automatically after loading the ch…
-
**Describe the bug**
A clear and concise description of what the bug is. Please include which training step you are using and which model you are training.
Training Step: 3-RLHF
Training model: act…
N33MO updated
8 months ago
-
**Describe the bug**
When training Deepspeed-Chat Step3 with **ZeRO3**(without hybrid-engine), if we set `generation_batches >= 3` or `generation_batches >= 2 and ppo_epochs >= 2`, deepspeed will rai…
GoSz updated
8 months ago
-
**Describe the bug**
Loading the llama2 70b model using 4 bit(bitstandbytes) and then distributed the model by calling deepspeed.initialize. Get the following error
```
------------------------…
-
Right now openfold asks for old pytorch + cuda (11.2), thus latest linux is not able to build openfold.
Would like to upgrade the supported pytorch + cuda and other python packages accordingly, so …
-
**Describe the bug**
DeepSpeed ZeRO++ features aren't working:
1. On a single node, passing `zero_hpz_partition_size` , `zero_quantized_gradients` , `zero_quantized_weights` leads to foward pass err…
-
I am seeing the following error when trying to run it on aarch64 machine with H100.
Linux r8-u37 6.5.0-1019-nvidia-64k #19-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 12:54:40 UTC 2024 aarch64 aarch64 …