-
Hi,
I'm just wondering if there is a potential issue for all-reduce order when both data parallelism and tensor model parallelism are enabled during training. With **torch DDP**, both tensor model …
-
Another feature idea is for a case where we allow the user to upload their model, which we will train for them and evaluate on a chosen benchmark with a chose metric.
We can think of ways to use op…
-
Thanks for this work.
I was trying to train the model using the conda environment:
```
pytorch 2.1.2 py3.11_cuda11.8_cudnn8.7.0_0 pytorch
pytorch-cuda …
-
As the title says, I'm having problems running the example code, which is given here: [Multi-GPU distributed training with PyTorch](https://keras.io/guides/distributed_training_with_torch/)
![image…
-
A user requested for an example of running distributed training with [accelerate](https://huggingface.co/docs/accelerate/basic_tutorials/launch).
-
### 🐛 Describe the bug
code:
```python
from torchtext.vocab import build_vocab_from_iterator
import torchtext
from typing import Iterable, List
import random
import os
import torch
from tqdm …
-
Instead of using our own task pool, we should leverage Dask distributed, as this will allow us to better consume resources from existing clusters.
-
I found that KataGo conducts self-play and then generates a large number of rows, which are then uploaded. What are these data used for? Because it doesn't seem to be doing backward propagation like t…
-
hello, I wan't to ask how to train mae pretrain in Multi-node Multi-gpu distributed using network ?
Can you provide a script?
-
Hi,
thanks for your work guys!
I am trying to explore using your implementation for our use-case, but I am stuck a bit on how you would deal with cases where the training set is too big to fit in…