-
### What Happened?
Good example about PODS running in different nodes, anyway I think that the deployment .yaml file needs more explanation about affinity and the fact that the Target port = http sha…
-
Hi,
Let's say, I have a slurm cluster that contains 100 nodes, each node has 100 cores. Assuming I have 10000 tasks.
This is my current code:
```
dist_executor = SlurmPipelineExecutor(
…
-
rdvz fail to work with SkyPilot multi-node cluster (probably on k8s).
https://github.com/stas00/ml-engineering/blob/master/network/benchmarks/all_reduce_bench.py
_Version & Commit info:_…
-
When deploying on a multi-node cluster (EKS in my case but I guess it could be any other), there's a PVC clash between the model store and the model pod.
The model pod gets this error and so it canno…
-
Hi,
Based on the zarf documentation, my understanding is that the init package can deploy single node k3s clusters on the local box.
This request is for:
- support for multi node k3s cluster …
-
**Is your feature request related to a problem? Please describe.**
I have an Elasticsearch cluster with no load balancer, so I must specify more than one host when creating a client.
**Describe th…
-
When starting up a multi-node cluster, you can have all of the agents running but 'get-nodes' will only report the local agent/controller combo on the manager. You have to deploy a config (even a bla…
-
### Multi-node TPU Training with JAX
The [multi-GPU JAX training guide](https://keras.io/guides/distributed_training_with_jax/) is helpful, but it's unclear how to extend this to multi-node TPU set…
-
### SynapseML version
com.microsoft.azure:synapseml_2.12:0.11.4-spark3.3
### System information
- **Language version** (e.g. python 3.8, scala 2.12): python 3.9
- **Spark Version** (e.g. 3.2.3): 3…
-
**Describe the bug**
Once config **Setting:support-bundle-image**, will fail to reset to default value.
**To Reproduce**
Steps to reproduce the behavior:
1. Go to 'Settings' -> support-bundle-…