-
### Description
In a multi-node cluster, it's possible when deleting a devworkspace that uses the per-user/common PVC strategy for the PVC cleanup pod to be scheduled on a node that is different than…
-
### 🚀 The feature, motivation and pitch
We need to have a tutorial/doc about setting up Multi-node, Multi-GPU environment for Slurm cluster. https://github.com/aivanou/disttraining/tree/main/slurm is…
-
**Describe the Bug**
multicluster project can not create namespace in member cluster。
i create a multicluster project named test-multi-cluster.
on member cluster,i found , the namespace can not…
-
Hi... We are currently using AWS NLB which is routing packets to worker node of EKS cluster which gives socket connection error whenever we scale PODs up or down. AWS team has recommended to use NLB-I…
-
Hello,
I am attempting to perform fine-tuning on a model using multiple nodes, each equipped with 8 A100 GPUs, and I'm encountering some difficulties. The implementation of Octo is based on JAX, an…
-
I am a principal scientist at a korean astronomy institute, especially interested in applying Big Data techs to Astronomical Problems.
I have found two issues when I try to run eazypy on my Spark …
-
**Is this a BUG REPORT or FEATURE REQUEST?**:
/kind bug
**What happened**:
The current way to specify the list of zones where a volume would be available is to specify a JSON blob as a value of…
-
I'm running 22.6.1 from a pip install in venv, and cannot get mqtt transport working, see crash log below.
=====
```
2022-09-08T10:06:50+0200 [Router 17775] Unhandled Error
Traceback (mos…
-
您好!我在64卡上外推72b模型时一直遇到OOM的问题,是不是multi_node.yaml中配置错了?
multi_node.yaml
`debug: false
deepspeed_config:
deepspeed_config_file: utils/accelerate_configs/zero3_offload.json
deepspeed_multinode_lau…
-
Setup:
1 master node
2 worker nodes
Image used: vhiveease/vhive_dev_env
Error:
Knative function containers fail to boot. Status loops between [Error, CrashLoopBackOff, Terminating].
Error log:
…