-
First and simpler approach for HPC integration is to use ssh access and ssh keys so our app user can login to the cluster as users and start the slurm job as them.
Note that CAS integration (includ…
-
GPU are more and more used in scientific servers. It will be nice to have GPU stats features into PSUtil.
For examples of existing monitoring GPU software for Intel, NVidia or AMD GPU, see the post h…
-
### What is version of KubeKey has the issue?
v3.0.12
### What is your os environment?
centos 7.9
### KubeKey config file
```yaml
apiVersion: kubekey.kubesphere.io/v1alpha2
kind: Cluster
metada…
-
# Open Grant Proposal: `NGPU -- AI DePin`
**Project Name:** `NGPU`
**Proposal Category:** `Integrations`
**Individual or Entity Name:** `Metadata Labs Inc.`
**Proposer:** `Alain Garner `
…
-
We trained custom rtdetrv2 models using multi-gpu setting. With single gpu training it works fine. But when we utilized multi-gpus training is just hanging in the first epoch for a longer time. We hav…
-
kubernetes version:v1.23.16
# nvidia-docker info
Client: Docker Engine - Community
Version: 24.0.2
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
…
-
`20210817 09:44:36 WARN: Can't load NVML library, dlopen(2): failed to load libnvidia-ml.so, libnvidia-ml.so: cannot open shared object file: No such file or directory`
`20210817 09:44:36 WARN: NVML …
-
### What is version of KubeKey has the issue?
kk version: &version.Info{Major:"3", Minor:"0", GitVersion:"v3.0.13", GitCommit:"ac75d3ef3c22e6a9d999dcea201234d6651b3e72", GitTreeState:"clean", BuildDa…
-
### Is there an existing issue for this?
- [X] I searched the existing issues and did not find anything similar.
### Current Behavior
When I am opening Resources app, the nvidia GPU wakes up from s…
-
### 🐛 Describe the bug
The process is working correctly with DDP world size 1 but then with world size > 1 is going to hang with GPU 0 at 0% and GPU 1 fixed to max occupancy. I've replicated this bot…
bhack updated
41 minutes ago