-
I'm starting a bit of research and looking for advice/insight into expanding this issue into a full-featured spec for creating a distributed polkadot validator cluster. This would be similar to what […
-
I would like to be able to specify the cookie used when connecting to a certain remote node.
My use case: I have a LAN setup with two separate, distributed applications running. I am currently usin…
-
**Why is this needed**:
Community questions are becoming more frequent about how to make use of Tempo in their distributed tracing system, and
specifically the integrations that exist within Grafa…
-
My scheduler is not reachable by the public web, I actually use a SOCKS5 proxy to reach it. The reason is, i'm limited by the number of public IPs I can have at one time. To perform my task, I'm using…
-
### 🐛 Describe the bug
I'm training a vqgan model and there is a forward operation which do allreduce across batch to get an estimation of the data distribution. It successfully ran for hours and han…
-
I work on a server with a Jupyterhub and have access to a pbs cluster, both machines have the same Python environments.
Right now I do the following (manual work):
1. I start workers on the cluste…
-
## 🐛 Bug
`torch.distributed.nn.all_reduce` computes different gradient values from `torch.distributed.all_reduce`. In particular, it seems to scale the gradients by `world_size` incorrectly.
## …
-
Currently the Backplane feature is only available for Redis cache.
Would it be much work to get the same setup available when using SQL Server as the distributed cache?
-
When I try to run data parallel on single machine with 2 GPUs, the following error happened.
```
NCCL version 2.7.8+cuda11.0
xxxxx:2573:2612 [1] graph/xml.cc:332 NCCL WARN Could not find real pat…
-
### 🚀 The feature, motivation and pitch
When using `TORCH_DISTRIBUTED_DEBUG=DETAIL` we collect collectives fingerprints and those are quite helpful when troubleshooting issues like stragglers.
One…