issues
search
NVIDIA
/
deepops
Tools for building GPU clusters
BSD 3-Clause "New" or "Revised" License
1.25k
stars
326
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Increase KillWait to 120 in slurm.conf
#1318
ilya-da
opened
2 weeks ago
0
KillWait default in slurm slurm
#1317
ilya-da
opened
2 weeks ago
1
Update GPU process cleanup logic in SLURM epilog script
#1316
ilya-da
opened
2 weeks ago
0
fetching PIDs for timeout jobs for cleanup sometimes fail to kill processes
#1315
ilya-da
opened
2 weeks ago
1
msg: 'Error connecting: Error while fetching server API version: Not supported URL scheme http+docker'
#1314
seltsa
opened
1 month ago
0
deepops 24.08?
#1313
mathrock74
closed
2 weeks ago
3
Unable to install some galaxy collections using ./scripts/setup.sh
#1312
alldino
opened
1 month ago
0
Ansible playbook failing to add RHEL 8 DGX Node in K8s cluster
#1311
subasathees
opened
1 month ago
0
It's seem miss galaxy folder
#1310
v-ducnt69
opened
2 months ago
1
Adding a Lua submission script
#1309
clemsgrs
closed
2 months ago
2
Upgrading NVIDIA Driver without reseting cluster
#1308
Heegreis
opened
4 months ago
2
Errors in deepops/slurm-exporter
#1307
fa-ina-tic
closed
4 months ago
4
NIS configuration
#1306
nttg8100
closed
3 months ago
1
Compatibility with DGX H100
#1305
anubhavpatrick
closed
5 months ago
1
Enabling persistent MIG in GPU instances of DGX-A100
#1304
murukessanap
closed
5 months ago
2
Deepops Slurm NCCL Fail
#1303
andrevianadf
closed
7 months ago
2
Error Running ansible-playbook on slurm-cluster: Docker-ce Repository Activation Issue
#1302
sikso1892
closed
8 months ago
1
Update ansible.cfg
#1301
Musab0
closed
7 months ago
0
playbook slurm-cluster fails on DGX OS 6 on nvidia-peer-memory task
#1300
itzsimpl
closed
9 months ago
1
TLS certificate replacement steps are unclear
#1299
programmer94
closed
9 months ago
1
Extend single node K8s DeepOps with additional nodes
#1298
cocakohler
closed
9 months ago
1
NVML version + H100 GPU
#1297
mathrock74
closed
1 year ago
3
Release 23.08
#1296
dholt
closed
1 year ago
0
slurm-master without GPU failed at nvml autodetect
#1295
leoncamel
closed
1 year ago
3
Release updates
#1294
dholt
closed
1 year ago
0
Fix for docker install playbook due to kubespray changes
#1293
dholt
closed
1 year ago
0
update nvidia_driver_ubuntu_cuda_keyring_package to latest version
#1292
JH-LEE-KR
closed
1 year ago
0
Update the Network Operator
#1291
supertetelman
closed
1 year ago
1
Docker installation playbook no longer working
#1290
supertetelman
closed
1 year ago
0
K8s dashboard is not viewable by default due to https configuration
#1289
supertetelman
closed
1 year ago
1
update roles to latest versions
#1288
dholt
closed
1 year ago
0
fix for out-of-date 3rd party ansible role causing error
#1287
dholt
closed
1 year ago
1
BUG:1284 - K8s Dashboard update
#1286
supertetelman
closed
1 year ago
0
nodelocaldns forever crashing/restarting [Info/Solution]
#1285
Steven9Smith
closed
1 year ago
2
no token generate with ./scripts/k8s/deploy_dashboard_user.sh
#1284
Steven9Smith
closed
1 year ago
3
Bump Kubeflow (1.7.0) and kustomize (5.1.0)
#1283
supertetelman
closed
1 year ago
2
Bump Kubespray to v2.22.1
#1282
supertetelman
closed
1 year ago
0
Version bumps for GPU Operator, GFD, and Device Plugin (23.3.2)
#1281
supertetelman
closed
1 year ago
0
Is this proyect alive?
#1280
morsoinferno
closed
1 year ago
3
Minor: Fix hardcoded slurm username
#1279
jeremyfix
closed
1 year ago
1
Building Slurm with Lua
#1277
rkevk
closed
1 year ago
2
Error: alpine-glibc-shim was not installed
#1276
paoloaq
closed
1 year ago
2
[HELP] How can we add all available gpus?
#1275
asher-lab
closed
1 year ago
1
Deos Deepops support NVIDIA driver version 515 or 525?
#1274
Meeshel7
closed
1 year ago
1
Error mounting /home: umount: /home: target is busy
#1273
starlitsky2010
closed
1 year ago
2
ERROR! 'include' is not a valid attribute for a Play
#1272
jerry-birdseye
closed
1 year ago
2
nvme Operation not permitted
#1270
georgecreis
closed
1 year ago
1
Ensure docker-ce repository is enabled failed
#1269
hakimamarullah
closed
1 year ago
0
node exporters don't work after initial run of slurm playbook
#1267
jsharpe
closed
11 months ago
5
Slurm build deps on Ubuntu missing libdbus-1-dev
#1266
jsharpe
closed
1 year ago
2
Next