distributed-llm Search Results

1000+ results
for distributed-llm

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

vllm-project/vllm #6208

[Bug]: AsyncEngineDeadError: Task finished unexpectedly with…

### Your current environment ```text PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC …

thomZ1 updated 1 month ago
7
pytorch/torchtune #1107

Save intermediate checkpoints during training

Is there any way to save intermediate checkpoints during training? Sometimes my training may fail during the middle due to external reasons, it will be helpful to save every N steps so I can contin…

l3utterfly updated 3 weeks ago
11
vllm-project/vllm #5692

[Bug]:Qwen2-57B-A14B 两卡推理报错

### Your current environment 环境： torch 2.3.0 vllm 0.5.0.post1 transformers 4.41.2 主要报错情况： moe小一点的模型 '/data/models/qwen/qwen1.5-2.7Bmoe' 不会出问题对于大一点的就报错如最下面。代码： from vllm.engine.arg_ut…

CXLiang123 updated 1 week ago
8
NVIDIA/nccl #827

ncclInternalError while fine-tuning using deepspeed

I am trying to run a training script using deepspeed on 8 32GB V100 GPUs. For debugging, I enabled the following flags: ``` NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,GRAPH NCCL_TOPO_DUMP_FILE=top…

udhavsethi updated 1 year ago
3
huggingface/transformers #32412

Incomplete memory allocation of dual GPU

### System Info - `transformers` version: 4.43.0.dev0 - Platform: Linux-5.15.0-117-generic-x86_64-with-glibc2.35 - Python version: 3.10.12 - Huggingface_hub version: 0.23.4 - Safetensors versio…

Skit5 updated 2 weeks ago
2
tatsu-lab/stanford_alpaca #228

Signal 7 error while finetuning with deepspeed

I am trying to run the finetuning script using 8 32GB V100 GPUs. I am using the torchrun command for using deepspeed with both parameter and optimizer offload, with a few minor modifications: ``` to…

udhavsethi updated 1 year ago
6
modelscope/ms-swift #851

使用DDP运行时显存不够，但是使用Model Parallel时可以正常finetune，耗时很大

nproc_per_node=4 CUDA_VISIBLE_DEVICES=0,1,2,3 \ NPROC_PER_NODE=$nproc_per_node \ swift sft \ --model_id_or_path "AI-ModelScope/llava-v1.6-mistral-7b" \ --template_type "llava-mistral-inst…

AlexJJJChen updated 3 months ago
6
pytorch/pytorch #130530

Fail to offload FSDP model weights and optimizer states with…

### 🚀 The feature, motivation and pitch Hi Pytorch maintainers, I am currently engaged in training multiple large language models (LLMs) sequentially on a single GPU machine, utilizing FullShard…

PeterSH6 updated 1 month ago
1
huggingface/alignment-handbook #59

Get this error on run_sft.py when calling "trainer.push_to_h…

Here's the call I'm using to run the script: ``` ACCELERATE_LOG_LEVEL=info accelerate launch --config_file examples/hf-alignment-handbook/configs/accelerate_configs/deepspeed_zero3.yaml --num_proces…

ohmeow updated 7 months ago
7
vllm-project/vllm #1968

RuntimeError: Inplace update to inference tensor outside Inf…

1*8H100 DGX BOX Torch version: 2.1.1 CUDA version: 12.1 VLLM: 0.2.3 The inference works just fine in tensor parallel 1 but when using **tp > 1** I am getting this error below: WARNING 12-0…

imraviagrawal updated 4 days ago
13

上一页 1...6 7 8 9 10 11 12...100 下一页

1000+ results for distributed-llm

1000+ results
for distributed-llm