-
Using the latest main to train a YoloV9e object detector:
```
[rank0]: train_one_epoch(train_loader, model, args, model_dtype)
[rank0]: File "/mnt/dingus_drive/catid/train_detector/train.py…
-
**What's the issue, what's expected?**:
There are attributes inside of regular `deepspeed.runtime` that are missing in this repo, and the monkey-patch doesn't cover, such as:
```python
from dee…
-
**Describe the bug**
I installed deepspeed using pip, and training was failing with deepspeed and I checked the ds_report, I found an error there but I'm not able to understand what it mean can you h…
-
Hi,
(As per [that request](https://github.com/matatonic/openedai-speech/issues/58#issuecomment-2351083007]))
Deepspeed seems to be a library that increases speed for AI related code that support i…
-
现在我切换到 Conda 全新的环境下,运行 pip install -e . 出现如下错误:
reparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
…
yibie updated
2 weeks ago
-
![梯度爆炸](https://github.com/user-attachments/assets/b30ba4ef-5689-4933-bd71-cbb8c7a02c72)
I used a self-built FIM SFT dataset for fine-tuning, and encountered abnormal loss when training with DeepSp…
-
**Describe the bug**
Training a hf model (llama 3.1 with peft) on long context with sequence_parallel_size > 1 works only up until zero stage 2.
If I set "stage" to 3 I get the following error:
`…
-
### Describe the bug
i'm using the train_dreambooth_flux.py to finetune flux. i get oom on 4x A100 80gb with deepspeed stage 2, gradient checkpoint, bf16 mixed precision, 1024px *1024px input, adafac…
-
Got this issue when run the command `deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py`
```
CUDA_HOME does not exist, unable to compile CUDA op(s)
```
Here is the full log
```
$ deeps…
-
I'm using Windows 11. If I try to install requirements.txt, deepspeed will not install because it says torch needs to be installed, so maybe the instructions are out of order. Here is what happens if …