-
**Environment:**
Ubuntu 22.04.4 LTS
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0
ds_report added at the end of the description
**Issue:** Not able to…
-
# DeepSpeed Investigation: What I Learned | IAmANerd
An investigation into the awesome DeepSpeed library for training large models on a single GPU!
[https://nathancooper.io/i-am-a-nerd/deepspeed/dee…
-
**Describe the bug**
A clear and concise description of what the bug is.
**To Reproduce**
Steps to reproduce the behavior:
[the official doc](https://github.com/microsoft/DeepSpeed/blob/master/b…
-
### 🐛 Describe the bug
Background of the issue:
DeepSpeed depends a lot on param.data = other.data for ZeRO3 parameter offload. And ZeRO3 also depends on register a hook on param AccumulateGrad ob…
-
As an owner of a Radeon 7900 XTX, I'm wondering if this project could be made to support AMD cards too. The problem is the `xformers` dependency which does not support AMD cards. Does Open-Sora-Plan u…
-
### Is there an existing issue for this?
- [X] I have searched the existing issues
### Current Behavior
使用全精度多卡训练时,编译torch extentions报错:
In file included from /home/adamzhangchao/anaconda3/e…
-
Error occurred running bing_bert/ds_train_bert_nvidia_data_bsz64k_seq128.sh
>Detected CUDA files, patching ldflags
Emitting ninja build file /home/bduser/.cache/torch_extensions/py38_cu114/fused_l…
-
**Describe the bug**
The builds on conda-forge have been failing since `deepspeed=0.14.1` for CUDA 11.8 and 12.0 with an error like `fatal error: oneapi/ccl.hpp: No such file or directory`. Origina…
-
**Motivation:**
Currently, when using the Transformers library in combination with DeepSpeed for training large language models like LLMs, checkpoints (e.g. `bf16_zero_pp_rank_0_mp_rank_00_optim_stat…
-
I am trying to reproduce the FLAN-T5-XXL (11B) results from [this blog post](https://www.philschmid.de/fine-tune-flan-t5-deepspeed).
I have an 8xA10G instance. Since the blog shows that you can run…