-
Hi!
As recently discussed in #145 and #144 with @Xiaoming-Zhao (and I as had already mentioned in https://github.com/atong01/conditional-flow-matching/pull/116#discussion_r1695722539), I/we believe…
-
Hallo,
I have been training model in distributed pytorch using hugging face trainer API. Now i have been training model on slrum multi node multi gpu and for every GPU, it logs in mlflow ui. Is th…
-
## ❓ Questions and Help
Hi!
We are trying to train Gemma-2-9B on v4-64 and v5-128 Pod as mentioned in [this comment](https://github.com/pytorch/xla/issues/7987#issuecomment-2352326629). We use FS…
ayukh updated
1 month ago
-
## 🐛 Bug
As can be seen below Thunder is slower than torch.compile for Phi-3-mini-4k-instruct for DDP and FSDP with zero2.
![image](https://github.com/user-attachments/assets/f119fcb0-f5b3-4338…
-
### Description
hello everyone,
I'm a newbie with t2t and tensorflow. I tried to use t2t to run transformer_moe model in 2 machines ,but failed. Each one has only one gpu. Hope you guys could help…
-
- **Contributor**: @sandipanpanda - [LinkedIn](https://linkedin.com/in/sandipanpanda)
- **Mentors**: @tenzen-y, @andreyvelich, @terrytangyuan, @shravan-achar
- **Organization**: [Kubeflow](https://w…
-
I have a problem training the model with my own dataset when using Distributed Mode. I wish to train the model on 2 GPUs and the message I get is:
RuntimeError: Expected to have finished reduction …
-
We should update [the Training Operator ROADMAP](https://github.com/kubeflow/training-operator/blob/master/ROADMAP.md) with 2024 work items.
Let's discuss it during [the upcoming Training WG calls]…
-
hcho3 updated
4 years ago
-
I want to train VGG16_ImageNet_Distributed.py at multiple node using mpiexec (two gpu on one node)
so, I followed instructions in https://docs.microsoft.com/en-us/cognitive-toolkit/Multiple-GPUs-and-…