ddp-training Search Results

1000+ results
for ddp-training

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

modelscope/ms-swift #2443

多机多卡的情况下，微调会被卡住

我使用两台机器，每台机器4张显卡。运行命令：accelerate launch --dynamo_backend no --machine_rank 0 --main_process_ip 192.168.68.249 --main_process_port 27828 --mixed_precision no --multi_gpu --num_machines 2 --num_processe…

boynicholas updated 2 days ago
8
BatsResearch/nayak-aclfindings24-code #5

Issue while finetuning Llama-3.2-1B

Encountered the following error while trying to run train_decoder.py ```sh [2024-11-05 06:09:22,154] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) …

1rsh updated 3 weeks ago
3
Lightning-AI/pytorch-lightning #20367

Training stuck at the first iter can't get corresponding pid

### Bug description The DDP training stuck at the 1st iter, and it's always waiting for pid: ![image](https://github.com/user-attachments/assets/e85d5e39-a24e-41e0-8bea-bcaa004a3473) os.waitpid()…

yejr0229 updated 1 week ago
1
hiyouga/LLaMA-Factory #6143

两台机器全参数微调Qwen2.5-14B-Instruct挂起不动

### Reminder - [X] I have read the README and searched the existing issues. ### System Info - `llamafactory` version: 0.9.1.dev0 - Platform: Linux-5.15.0-125-generic-x86_64-with-glibc2.31 -…

zhaoxjmail updated 15 hours ago
2
pytorch/examples #1096

DDP training question

Hi, I'm using the tutorial [https://github.com/pytorch/tutorials/blob/master/intermediate_source/ddp_tutorial.rst](url) for DDP train,using 4 gpus in myself code, reference Basic Use Case. But when I …

Henryplay updated 1 year ago
2
unit8co/darts #2265

[BUG] TypeError then predicting in multi-gpu scenario

**Describe the bug** After training a TFT with `ddp_spawn`strategy on multiple gpus in Amazon SageMaker the returned prediction of the trainer is None, leading to an `TypeError: 'NoneType' object is …

nejox updated 1 week ago
5
cnellington/Contextualized #163

TensorDataset + DDP training weirdness

Torch splits TensorDatasets across processes when using a distributed data-parallel strategy (default with multiple CUDA-enabled GPUs) https://pytorch.org/docs/stable/data.html#loading-batched-and-no…

cnellington updated 1 year ago
1
pytorch/pytorch #65936

[POLL][RFC] DataParallel Deprecation

Hey everyone, We wanted to let you know that we are considering the deprecation of DataParallel (`torch.nn.DataParallel`, a.k.a. DP) module with the upcoming v1.11 release of PyTorch. Our plan is t…

cbalioglu updated 3 weeks ago
21
mlcommons/chakra #166

Incorrect JSON format during Pytorch Execution Trace generat…

## Bug Description I am running a distributed Linear model (20 parameters) across 2 GPU Nodes, each node having 2 NVIDIA H100 NVL GPUs. The Model uses DDP parallelization strategy. I am generating…

arjuntemura updated 2 days ago
2
lightly-ai/lightly #1650

OoM issue with multiple gpus using Distributed Data Parallel…

When I run this example [runs on multiple gpus using Distributed Data Parallel (DDP) training](https://docs.lightly.ai/self-supervised-learning/examples/simclr.html) on AWS SageMaker with 4 GPUS and …

SebastienThibert updated 2 months ago
4

上一页 1...2 3 4 5 6 7 8...100 下一页

1000+ results for ddp-training

1000+ results
for ddp-training