ddp-training Search Results

1000+ results
for ddp-training

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

Lightning-AI/pytorch-lightning #20340

DDP and BackboneFinetuning: model weights get out of sync wh…

### Bug description When model training using DDP and pl.callbacks.BackboneFinetuning, it seems that model weights start to get out of sync across the processes after the backbone is unfrozen. Prio…

ksikka updated 1 month ago
2
hiyouga/LLaMA-Factory #4785

FSDP-QLora w/ DeepSeek-v2-lite dones't work on 4 GPUs

### Reminder - [X] I have read the README and searched the existing issues. ### System Info [2024-07-12 02:22:28,334] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda…

Jiayi-Pan updated 3 weeks ago
4
mlcommons/chakra #166

Incorrect JSON format during Pytorch Execution Trace generat…

## Bug Description I am running a distributed Linear model (20 parameters) across 2 GPU Nodes, each node having 2 NVIDIA H100 NVL GPUs. The Model uses DDP parallelization strategy. I am generating…

arjuntemura updated 6 days ago
1
microsoft/DeepSpeed #6701

[REQUEST] Non-element-wise Optimizer Compatibility

I am encountering issues when using non-element-wise optimizers such as Adam-mini with DeepSpeed. According to the documentation, it reads: > The FP16 Optimizer is designed to maximize the achievable…

Triang-jyed-driung updated 2 weeks ago
2
Lightning-AI/pytorch-lightning #17389

DDP training freezes immediately

### Bug description I'm trying to run a job with several GPUs. My script immediately gets stuck after outputting: ``` python /home/negroni/deeponet-fno/src/burgers/pytorch_deeponet.py --ngpus 3…

GeoffNN updated 1 year ago
6
facebookresearch/encodec #16

Some details about RVQ code

## ❓ Questions Hi, when I try to reproduce the training code based on your released part, I meet a question when I try to use multiple-GPU to train, that is, I find that [https://github.com/facebo…

yangdongchao updated 3 weeks ago
6
jinpeng0528/STAR #2

DDP training hangs when computing prototypes

Hi, thank you for your work. I tried to use 4 GPUs to reproduce the result, and I set the epoch to 1 for debugging. But the process hangs when computing prototypes after the base step's training, and …

zhengyuan-xie updated 8 months ago
5
ultralytics/ultralytics #17319

Customization on YOLOv10

### Search before asking - [X] I have searched the Ultralytics YOLO [issues](https://github.com/ultralytics/ultralytics/issues) and [discussions](https://github.com/ultralytics/ultralytics/discussion…

farhan-hafizh updated 3 weeks ago
11
unslothai/unsloth #1074

try multicard training with torchrun, meet triton error

Hi,great job! really appreciate your amazing work. however we have several 4080s cards that we try to accelerate training with,just test on your wonderful fast cross entropy kernal, but we are encoun…

2877992943 updated 1 month ago
1
hpcaitech/ColossalAI #3598

[BUG]: ddp training in diffusion

### 🐛 Describe the bug how can i use the ddp train in diffusion? i saw the train_ddp.yaml，but there is nothing different with the train_colossalai.yaml. how do i set the numbers of gpu and nodes or t…

zhangvia updated 1 year ago
7

上一页 1...3 4 5 6 7 8 9...100 下一页

1000+ results for ddp-training

1000+ results
for ddp-training