issues
search
NVIDIA
/
deepops
Tools for building GPU clusters
BSD 3-Clause "New" or "Revised" License
1.25k
stars
326
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
DeepOps deploys 510.85.02 driver graphics card by default for both k8s and slurm
#1215
SupermicroML
closed
2 years ago
2
AutoDetect=nvml on gres.conf not working. Error "fatal: We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured"
#1214
anateshan
closed
2 years ago
5
How to update ood apps?
#1213
arnoldas500
closed
1 year ago
1
OpenOnDemand 2.0 releasing .deb files today
#1212
johrstrom
closed
1 year ago
7
verify all GPU nodes plug-ins in the Kubernetes cluster Fails
#1211
arnoldas500
closed
1 year ago
7
OOD install on Ubuntu 20.04 Issues
#1210
arnoldas500
closed
1 year ago
1
Move nccl test container from network operator directory to src/conta…
#1209
yangatgithub
closed
2 years ago
0
Update dependency versions for release 22.08
#1208
ajdecon
closed
2 years ago
0
cccccbcrkniuelkgcbfuunibvekffrlhujkjregndnlh
#1207
laurevelli
closed
2 years ago
0
Rootless docker not working with slurm
#1206
arnoldas500
closed
1 year ago
9
Pytorch multi-gpu example hangs with Kubeflow but works with straight Docker
#1205
cupdike
closed
1 year ago
2
Copy Kubectl to /usr/local/bin hanging
#1204
iamadrigal
closed
1 year ago
4
[Errno 13] Permission denied
#1203
anateshan
closed
2 years ago
2
Deepops vs NAT issues
#1202
andrevianadf
closed
2 years ago
5
Support for Ubuntu 22.04
#1201
ajdecon
closed
1 year ago
3
update kubernetes-sigs/kubespray link
#1200
elgalu
closed
2 years ago
0
Slurm GPU cluster failing to run task on more than 1 node
#1199
karanveersingh5623
closed
1 year ago
8
Add alertmanager for slurm cluster
#1198
0leaf
closed
2 years ago
1
Add alertmanager for slurm cluster
#1197
0leaf
closed
2 years ago
0
Convert gres.conf syntax from CPUs to Cores
#1196
ajdecon
closed
2 years ago
0
[Slurm] syntax in generated gres.conf is incorrect
#1195
ajdecon
closed
2 years ago
1
Update default Slurm version to 22.05.2
#1194
ajdecon
closed
2 years ago
1
Nvme storage mount point usage
#1193
junrae6454
closed
2 years ago
1
Ansible playbook Slurm Installation failed , slurm master fails to get nvidia-smi over ssh , OOB session shows the nvidia-output
#1192
karanveersingh5623
closed
2 years ago
4
Specify runtime_path partition size
#1191
seyong-um
closed
2 years ago
0
OOD internal server error
#1190
arnoldas500
closed
2 years ago
2
NFS mount protocol error
#1189
arnoldas500
closed
2 years ago
4
Documentation-updates-062022
#1188
tuttlebr
closed
2 years ago
1
Fix bugs preventing slurm reinstall or rebuild
#1187
biocyberman
closed
2 years ago
1
reboot after a nvidia-smi error
#1186
georgettica
closed
2 years ago
3
Switch to using official MetalLB helm repo
#1185
ajdecon
closed
2 years ago
0
Fix GPG key import for a large set of DeepOps roles
#1184
ajdecon
closed
2 years ago
0
Fix ssh disconnection on compute nodes
#1183
seyong-um
closed
2 years ago
2
Have same slurm.conf among nodes and controller
#1182
seyong-um
closed
2 years ago
1
Exposing Kubeflow Pipelines for Remote Access
#1181
cupdike
closed
2 years ago
1
Updated NCCL results with DGX A100s; MPI commands; NCCL container image name and location
#1180
yangatgithub
closed
2 years ago
0
Slurm multi nodes cluster installation failure
#1179
karanveersingh5623
closed
2 years ago
26
Adding new node in slurm DeepOPS cluster failing
#1178
karanveersingh5623
closed
2 years ago
3
MetalLB installation is failing
#1177
ajdecon
closed
2 years ago
1
Trident fix alt
#1176
jasonguy
closed
2 years ago
2
Misc version bumps
#1175
ajdecon
closed
2 years ago
0
Clarify the gpu_operator configuration flags and HOW to set them for different scenarios
#1174
jasonguy
closed
2 years ago
3
Management of shared folder permissions
#1173
leadtekleadtek
closed
2 years ago
1
Enroot failing during Deepops Slurm cluster installation
#1172
karanveersingh5623
closed
2 years ago
1
Update default Slurm version to 21.08.8-2 on release-22.04 branch
#1171
ajdecon
closed
2 years ago
0
Wrong scope in roles/slurm/tasks/login-compute-setup.yml
#1170
chschulze
closed
2 years ago
3
Update default Slurm version to 21.08.8
#1169
ajdecon
closed
2 years ago
0
Nccl test Take 1
#1168
FattyRichness
closed
2 years ago
1
[release-22.04] Update NVIDIA signing key for package repos
#1167
ajdecon
closed
2 years ago
2
Update NVIDIA signing key
#1166
ajdecon
closed
2 years ago
0
Previous
Next