-
**Describe the bug**
Tempo metrics-generator always failed to push to Prometheus push gateway with a snappy error:
`ts=2024-07-30T08:40:27.386622385Z caller=dedupe.go:112 tenant=single-tenant co…
-
**Describe the bug**
During the PPO actor training run with TensorRT-enabled, there was an error encountered during the validation checkpointing process. The training was conducted using the Tensor…
-
In #78 it was argued that the elevated-voltage related update won't be distributed via this repository to avoid confusion (as it can't be applied from the OS). However, independent manufacturers and f…
-
### 🐛 Describe the bug
**Environment**
- PyTorch 2.4.0
- Kubernetes, launching distributed PyTorchJob (1 master and 1 worker replica) with KubeFlow Training Operator
- Launching with `torchrun…
-
On `PP + FSDP` and `PP + TP + FSDP`:
- Is there any documentation on how these different parallelisms compose?
- What are the largest training runs these strategies have been tested on?
- Are there…
-
### Describe the Bug
see log below
### Steps to reproduce
executed the exe file
### Relevant log output
```shell
{"level":"fatal","time":"2024-08-22T09:05:22.567+0100","message":"Error cloning r…
-
Hi,
I recently came across an issue when using context parallelism for splitting long sequence with NeMo and Transformer Engine. The context parallelism splits sequence length across GPUs and use p…
-
i wonder how to put it on a machine with multi-GPU to accelerate its training?
-
The modpack that we've been making is distributed to other players through a MultiMC instance .zip that we want to make sure to be as reproducible as possible. Of course, offering the ability to expor…
-
### Is there an existing issue for this?
- [X] I have searched the existing issues
### Version
equal or higher than v1.16.0 and lower than v1.17.0
### What happened?
I want to have a si…
mkm29 updated
16 hours ago