-
How can we synchronize files that are written during multi-node training?
* At the end of training, each node reads the file in question, turns in to byte tensor
* Synchronize the tensor length, com…
-
Thank you for your excellent work. You used a single V100 GPU for training. Will the programme support distributed training? We are trying to use multiple 4090 GPUs on the same machine to repeat the e…
-
when I used deepspeed for distributed training, I find it cost me a lot of time on forward_microstep and backward_microstep. Is there any solution to improve training efficiency?
-
Hello,
I came across your work, and was wondering whether loading and training models on multiple GPUs was possible.
I saw in the YOLOv7 repo that it was possible with the following command line…
-
### Describe the bug
I tried to use accelerate+deepspeed to train flux, but every time after a dozen steps, an error occurred and the program crashed. Can anyone provide some help?
### Reproduction
…
-
### 🐛 Describe the bug
TRAINING_SCRIPT.py
```
def main():
dist.init_process_group("nccl", init_method='env://')
.......
if __name__ == "__main__":
main()
```
when I run this …
-
I recently began to contribute to Katago distributed training. I noticed that the network is trained on strange initial board/komi conditions and are running on low visit counts. Is the strange initia…
-
### 🐛 Describe the bug
When I use `decorate_context` to convert a context manager into a decorator, I only ever see the generic decorate_context in stack traces. This sucks, because different context…
-
### System Info
```Shell
- `Accelerate` version: 0.34.0
- Platform: Linux-5.15.0-117-generic-x86_64-with-glibc2.17
- `accelerate` bash location: /home/miao/anaconda3/envs/train/bin/accelerate
- Py…
-
Since the shear amount of data we have, people might want to train in a distributed manner. We need to test and make sure our dataset is compatible with some distributed training frame work like `PyTo…