deepspeed-library Search Results

1000+ results
for deepspeed-library

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

microsoft/DeepSpeed #5597

[BUG] M1 Mac has an issue with `hostname -I` not being a val…

Not a particular issue I was facing, but a very offline friend who was using deepspeed initializes got an error where `HostName -I` isn't available on the M1 chip or OSX in general. A workaround I…

AbhinavMir updated 2 months ago
9
pytorch/pytorch #139152

Dynamo capture of tensor.data assignment doesn't identical t…

### 🐛 Describe the bug DeepSpeed uses a lot "param.data =" statement to updating the param data by gathering the param from other ranks. While we found param.data= assignment under torch.compile m…

jerrychenhf updated 5 days ago
4
microsoft/DeepSpeed-MII #226

Error reasoning about llama2-7b-hf model using MII-Public

I couldn't keep a stable connection with huggingface.co due to network reasons, and I got ConnectionError using the usage example you provided, so I changed the configuration of mii_config with the fo…

ly19970621 updated 1 year ago
4
pytorch/pytorch #136979

pytorch 2.1.2+cu118, RTX 8000, backward show RuntimeError: E…

### 🐛 Describe the bug When I want to train qwen2.5-7B-instruct with using deepspeed, it shows the following erre: ``` Traceback (most recent call last): File "/home/work/ybs/deeplm/LLM/train.py…

KiriEu updated 4 weeks ago
1
lxuechen/private-transformers #32

Support for multi-gpu private fine-tuning

Hi all, I wanted to try and add support for multi-gpu training to allow the fine-tuning of LLM. I've already [opened an issue](https://github.com/lxuechen/private-transformers/issues/31) a few week…

Pier297 updated 1 year ago
2
BAAI-WuDao/EVA #4

EVA script problem：脚本执行出错，需要协助定位问题所在，谢谢

执行eva脚本时，卡顿不执行。以下为日志信息： root@cfea9da46cdd:/mnt/EVA/src/scripts# bash infer_enc_dec_interactive.sh /opt/conda/bin/deepspeed --num_nodes 1 --num_gpus 1 --master_port 4586 --hostfile /mnt/EVA/src/conf…

gaoyuan211 updated 3 years ago
8
bigscience-workshop/metadata #67

feat: save the model and stop training based on `exit-durati…

As discussed a long time ago in a meeting it would be really great if we had a feature to save the model and stop training after a certain time as the jobs on the JZ cluster are limited to 20 hours. …

SaulLu updated 2 years ago
2
microsoft/DeepSpeed-MII #272

`FileNotFoundError: No such file or directory: pytorch_model…

I encountered an error `FileNotFoundError: [Errno 2] No such file or directory: '/home/cloud/.cache/huggingface/hub/models--yentinglin--Taiwan-LLM-13B-v2.0-chat/snapshots/419f643a34e4aa53ee5bc87bc1…

ngitnenlim updated 12 months ago
1
NVIDIA/nccl #1016

2 * 8 H100 GPU : NCCL error in: /opt/pytorch/pytorch/torch/c…

When I try to run multi node job between 2 H100 nodes, most of the times I am getting this error, Any ideas ``` pytorchjob-summarization-long-data-8vry-ravi-agrawa-worker-2:429:429 [3] NCCL INFO cu…

imraviagrawal updated 5 months ago
13
bigcode-project/starcoder #86

RuntimeError: RuntimeError: IndexError: list index out of ra…

Trying to fine tune bigcode/starcoderbase model on compute A100 with 2 GPUs , 40 GBx2 so 80GB. Finetune.py is slightly modified and loaded the model with 4bit, adopt Qlora and also the deep speed. T…

Kushalamummigatti updated 1 year ago
3

上一页 1...6 7 8 9 10 11 12...100 下一页

1000+ results for deepspeed-library

1000+ results
for deepspeed-library