distributed-training Search Results

1000+ results
for distributed-training

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

facebookresearch/AGRoL #20

About distributed training

The code seems to not support multi-gpu training. Although I find that the code has some parts to support it , it seems to not work. Can you fix it?

weizhenhuan updated 1 month ago
1
volcano-sh/volcano #3795

Distributed training with pytorch on multi-nodes

### Please describe your problem in detail I'm trying to start a pytorch training using volcano and pytorch plugin. I have 2 nodes, each with 8 gpus. I found that volcano sets WORLD_SIZE = 2, RANK …

almersawi updated 4 weeks ago
1
NVIDIA/Megatron-LM #944

Distributed Mamba Training

**How to customise the train.sh for a distributed Mamba Training ?** Hello, As i've seen in the megatron modules, there isn't a pre-defined bash script to pre-train a mamba model on multi-gpu, ho…

SkanderBS2024 updated 1 month ago
7
NVIDIA/Megatron-LM #1234

[BUG] Cannot Save mamba model in distributed training

**Describe the bug** While saving mamba based model, distributed optimizer report an error in validation about `dt_bias` **To Reproduce** Start the training of Mamba, and run it for a few step **Exp…

siriusctrl updated 1 month ago
2
VITA-Group/Q-GaLore #1

Distributed Training?

Does Q-Galore work with FSDP or DS?

philschmid updated 4 months ago
1
microsoft/DeepSpeed #6524

[BUG] Distributed Training randomly stuck in trainings loop

Hi I have a script that runs with the DataParralell trainer on a machine with 8 H100 GPUs (aws p5 VM) with deepspeed. When we run the script it starts to randomly get stuck forever at some iteration r…

raeudigerRaeffi updated 2 months ago
2
KempnerInstitute/kempner-computing-handbook #86

Adding Distributed GPU Computing, Distributed Training/Infer…

This is to add the following knowledge to HPC Handbook including code examples with easy-setup experiment. - GPU Computing - Distributed GPU Computing - Distributed Training - Distributed Inferenc…

amazloumi updated 3 weeks ago
2
AMDResearch/hpcfund #30

Get help for distributed model training on MI250

Hi, I would like to test a program for distributed LLM model training on mi2508x and I want to do model parallel to distribute parameters across GPUs. Is there any framework that I should use to ac…

OswaldHe updated 2 weeks ago
6
Yanfeng-Zhou/XNet #22

About distributed training

Hello！Thank you so much for your work I would like to ask is there any effect on removing distributed training from model training. Thank you!

s0mnus112 updated 3 months ago
2
speed1313/jax-llm #3

distributed training

speed1313 updated 5 months ago
2

上一页 1...1 2 3 4 5 6 7...100 下一页

1000+ results for distributed-training

1000+ results
for distributed-training