-
### Bug description
When running multi-node/multi-GPU training with different number of GPUs on each node, `Fabric` `ddp` and `fsdp` will have an incorrect `num_replicas` in `distributed_sampler_kwar…
-
For the past few months I've been working on a program that needs all-to-all exchanges and Realm doesn't seem to perform distributed all-to-all communication efficiently. To understand what an efficie…
-
https://github.com/coreos/coreos-kubernetes/blob/master/multi-node/aws/pkg/config/templates/cloud-config-worker#L9
here it seems like docker service doesn't have a restart policy. I am sure I am miss…
-
**POD not ready with errors**
After a successfully deployment of the OVA the install does not complete and is stuck loading one of the pods.
![image](https://github.com/user-attachments/assets/5…
-
Hi, I have two datacenter and on datacenter by on two node.
On each datacenter Total replicas : 1 (i need a replica on datacenter)
first dc
leofs-adm status
[System config]
System…
-
### Proposal
When making a client join a cluster, I’d like the client to be ineligible to accept job allocations until I intervene manually.
### Use-cases
I'm building a user interface that will …
-
https://www.notion.so/cybnity/447-d01de61153714443ae8fc294300b773a
REQ_MAIN4: https://www.notion.so/cybnity/REQ_MAIN_4-8513483dd519412087185e24134453bc?pvs=4
As Clusterizable independent unit per tec…
-
### Proposal
I run a Nomad cluster with `amd64`, `arm64`, and `riscv64` nodes. If I try to use a docker image that only supports `amd64`, nomad will sometimes schedule it on one of the non-`a…
-
**What happened**:
With "--topology-manager-policy=single-numa-node" enabled on kubelet, creating a ReplicaSet (or other entity which automatically creates pods) resulted in hundreds of pods with a s…
-
## Problem
At present, the Stack Monitoring application does a good job of showing performance statistics for any individual node in an Elasticsearch cluster, but it is challenging to compare perfo…