-
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 13077 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 13076) …
-
### Contact Details
_No response_
### Is there an existing issue for this?
- [X] I have searched all the existing issues
### Is your feature request related to a problem? Please describe.
…
-
States cannot be saved during distributed training: https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/kvstore.py#L538-L550
-
For the model I am training, I am relying on a custom [Sampler](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler), that returns variable batch sizes. My task at hand is translation, …
-
I am using a customed dataset where the data is loaded from disk in `__init__` function of dataset. But I found that the data will be loaded n times if I use n gpus (which also means the `num_processe…
-
### System Info
- `transformers` version: 4.45.0.dev0
- Platform: Linux-4.18.0-477.10.1.el8_8.x86_64-x86_64-with-glibc2.28
- Python version: 3.11.5
- Huggingface_hub version: 0.24.0
- Safetenso…
-
If I write my own multi-GPU model or use `torch.distributed.pipeline.sync.Pipe`, would multi-node training still work with byteps?
-
We need to add NCCL support as backend/implementation of Communicator abstraction, which will provide all required functionality for synchronous distributed SameDiff training
-
Deepspeed support finetune extra model with lora ?
-
我看fate对torch的nn 有些封装,包括Sequential这一类,同时也看到了lstm的模型,但怎么使用呢?lstm的输出有是个tuple,没法直接add 进Sequentia吧?