-
Not a particular issue I was facing, but a very offline friend who was using deepspeed initializes got an error where `HostName -I` isn't available on the M1 chip or OSX in general.
A workaround I…
-
### 🐛 Describe the bug
DeepSpeed uses a lot "param.data =" statement to updating the param data by gathering the param from other ranks.
While we found param.data= assignment under torch.compile m…
-
I couldn't keep a stable connection with huggingface.co due to network reasons, and I got ConnectionError using the usage example you provided, so I changed the configuration of mii_config with the fo…
-
### 🐛 Describe the bug
When I want to train qwen2.5-7B-instruct with using deepspeed, it shows the following erre:
```
Traceback (most recent call last):
File "/home/work/ybs/deeplm/LLM/train.py…
-
Hi all,
I wanted to try and add support for multi-gpu training to allow the fine-tuning of LLM. I've already [opened an issue](https://github.com/lxuechen/private-transformers/issues/31) a few week…
-
执行eva脚本时,卡顿不执行。以下为日志信息:
root@cfea9da46cdd:/mnt/EVA/src/scripts# bash infer_enc_dec_interactive.sh
/opt/conda/bin/deepspeed --num_nodes 1 --num_gpus 1 --master_port 4586 --hostfile /mnt/EVA/src/conf…
-
As discussed a long time ago in a meeting it would be really great if we had a feature to save the model and stop training after a certain time as the jobs on the JZ cluster are limited to 20 hours.
…
-
I encountered an error
`FileNotFoundError: [Errno 2] No such file or directory: '/home/cloud/.cache/huggingface/hub/models--yentinglin--Taiwan-LLM-13B-v2.0-chat/snapshots/419f643a34e4aa53ee5bc87bc1…
-
When I try to run multi node job between 2 H100 nodes, most of the times I am getting this error, Any ideas
```
pytorchjob-summarization-long-data-8vry-ravi-agrawa-worker-2:429:429 [3] NCCL INFO cu…
-
Trying to fine tune bigcode/starcoderbase model on compute A100 with 2 GPUs , 40 GBx2 so 80GB.
Finetune.py is slightly modified and loaded the model with 4bit, adopt Qlora and also the deep speed. T…