distributed-training Search Results

1000+ results
for distributed-training

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

WongKinYiu/yolov7 #73

torch.distributed.elastic.multiprocessing.api:failed (exitco…

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 13077 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 13076) …

jay985735639 updated 2 years ago
1
cubefs/cubefs #3064

[Feature]: 2024 plan

### Contact Details _No response_ ### Is there an existing issue for this? - [X] I have searched all the existing issues ### Is your feature request related to a problem? Please describe. …

leonrayang updated 1 week ago
6
apache/mxnet #15923

kvstore.save_optimizer_states does not support distributed k…

States cannot be saved during distributed training: https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/kvstore.py#L538-L550

eric-haibin-lin updated 5 years ago
1
microsoft/DeepSpeed #1051

Dynamic/variable batch size support

For the model I am training, I am relying on a custom [Sampler](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler), that returns variable batch sizes. My task at hand is translation, …

ecly updated 1 month ago
17
huggingface/accelerate #3001

Data was loaded multiple times when using multiple GPUs

I am using a customed dataset where the data is loaded from disk in `__init__` function of dataset. But I found that the data will be loaded n times if I use n gpus (which also means the `num_processe…

Imsovegetable updated 2 weeks ago
6
huggingface/transformers #33666

Qwen2-VL: Multi-GPU training

### System Info - `transformers` version: 4.45.0.dev0 - Platform: Linux-4.18.0-477.10.1.el8_8.x86_64-x86_64-with-glibc2.28 - Python version: 3.11.5 - Huggingface_hub version: 0.24.0 - Safetenso…

ManuelFay updated 3 days ago
6
bytedance/byteps #382

Is model parallelism supported for PyTorch?

If I write my own multi-GPU model or use `torch.distributed.pipeline.sync.Pipe`, would multi-node training still work with byteps?

liaopeiyuan updated 3 years ago
1
deeplearning4j/deeplearning4j #7878

NCCL support

We need to add NCCL support as backend/implementation of Communicator abstraction, which will provide all required functionality for synchronous distributed SameDiff training

raver119 updated 4 years ago
2
microsoft/DeepSpeedExamples #890

Deepspeed support finetune extra model with lora ?

Deepspeed support finetune extra model with lora ?

wanghongqu updated 5 months ago
1
FederatedAI/FATE #5707

2.0支持纵向的lstm吗？

我看fate对torch的nn 有些封装，包括Sequential这一类，同时也看到了lstm的模型，但怎么使用呢？lstm的输出有是个tuple，没法直接add 进Sequentia吧？

FancyXun updated 1 week ago
4

上一页 1...89 90 91 92 93 94 95...100 下一页

1000+ results for distributed-training

1000+ results
for distributed-training