ddp-training Search Results

1000+ results
for ddp-training

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

HZAI-ZJNU/Mamba-YOLO #49

Multi-GPU training

When I use multi-GPU training, I encounter the following problem： subprocess.CalledProcessError: Command '['/home/a/anaconda3/envs/mambayolo/bin/python', '-m', 'torch.distributed.run', '--nproc_per…

wk565 updated 2 weeks ago
2
WongKinYiu/yolov7 #45

Distributed training(DDP)

why write here `parser.add_argument('--local_rank', type=int, default=-1, help='DDP parameter, do not modify')`，if i want to use DDP, should i change to 0

PANPEIWEN updated 2 years ago
1
POSTECH-CVLab/FastPointTransformer #12

DDP/DP training - multigpu

Hi @chrockey, great work! Can you guide me on how to set up multigpu training? I have only 20GB gpus available, and when using batch size of 2 I obtain poor performance (~6% lower mIoU and mAcc; pr…

helen1c updated 8 months ago
7
PKU-YuanGroup/Open-Sora-Plan #359

CausalVAE training does not have DDP support

I am trying to train the CausalVAE on my own dataset on 4 gpus but all memory is just used by device 0. Is distributed processing not incorporated into the training code?

ahmad-573 updated 4 months ago
1
unslothai/unsloth #1072

Re-open: ValueError: Unsloth: Untrained tokens found, but em…

Same here. I was pretraining LlaMA-3.1-7B-Instruct done, and then continue to finetuning w/ QLoRA normally. After 2 epochs, I switched to use Unsloth to continue the finetuning with longer context (80…

thusinh1969 updated 1 month ago
8
pytorch/pytorch #71303

[RFC] Cross-Process Performance Analysis: Straggler Detectio…

### 🚀 The feature, motivation and pitch ## Motivation: Limitation of Existing Profiling Approach To conduct PyTorch distributed training performance analysis, currently a recommended way is profil…

wayi1 updated 2 weeks ago
8
modelscope/ms-swift #2443

多机多卡的情况下，微调会被卡住

我使用两台机器，每台机器4张显卡。运行命令：accelerate launch --dynamo_backend no --machine_rank 0 --main_process_ip 192.168.68.249 --main_process_port 27828 --mixed_precision no --multi_gpu --num_machines 2 --num_processe…

boynicholas updated 5 days ago
6
ContextualAI/gritlm #65

CUDA OOM when finetuning meta-llama/Meta-Llama-3-8B-Instruct

I was trying to finetuning Meta-Llama-3-8B-Instruct using 4 gpus with the following command: `torchrun --nproc_per_node 4 -m training.run --output_dir llama3test --model_name_or_path meta-llama/Met…

zhj2022 updated 4 days ago
1
Lightning-AI/pytorch-lightning #20367

Training stuck at the first iter can't get corresponding pid

### Bug description The DDP training stuck at the 1st iter, and it's always waiting for pid: ![image](https://github.com/user-attachments/assets/e85d5e39-a24e-41e0-8bea-bcaa004a3473) os.waitpid()…

yejr0229 updated 6 days ago
1
pytorch/examples #1096

DDP training question

Hi, I'm using the tutorial [https://github.com/pytorch/tutorials/blob/master/intermediate_source/ddp_tutorial.rst](url) for DDP train,using 4 gpus in myself code, reference Basic Use Case. But when I …

Henryplay updated 1 year ago
2

上一页 1...1 2 3 4 5 6 7...100 下一页

1000+ results for ddp-training

1000+ results
for ddp-training