-
after a fresh pull and `.hpcts start` the slurmctd never starts.
```
...
frontend | -- Waiting for slurmctld to become active ...
ondemand | nc: connect to frontend (172.19.0.9) port 22 (tcp)…
-
because right now it takes ~15 minutes to tear down and replace, which is a bit of a drag on development.
here's what i think needs to happen:
- user changes `slurm.conf` on the server
- ansible …
-
## Classes
- [x] `Job` class providing read-only access to all possible members `slurmdb_job_rec_t`
- [x] `Jobs` class, acting as a custom collection (dict) to retrieve actual job-information, and…
-
After rebuilding a cluster both prometheus and grafana failed to start in monitoring.yml. Investigation showed state files were not all owned by relevant users, e.g.:
```shell
[root@dev-control ro…
sjpb updated
11 months ago
-
gateway网关服务总是在重启,我试了v0.4.0以及master最新的镜像也不行。
![image](https://github.com/PKUHPC/SCOW/assets/25796741/57a34176-093c-4ac4-bd20-2a936e868b2a)
日志报错:
@scow/gateway: { tag: 'v0.4.0', commit: 'd368fa621…
-
when need to restart several service, follow this order :
- slurmdbd
- slurmctld
- slurmd
-
The initial approach seems not to communicate
using the below architecture for communication within clusters
![38642211-67a7e1a4-3da7-11e8-85a9-3394ad3c8cb6](https://github.com/ecohealthalliance/s…
-
**Describe the bug**
I'm running query-exporter within a Docker container. WHen I try to start it with the example config.yaml I get the following error:
```
unhandled exception during asyncio.…
-
Possibly related: #141
I created an operator that uses jobset to put together a slurm cluster, and I should have 4 replicated jobs, all indexed, and all of size 1 except for the workers. I'm havin…
-
Hi Developer,
I try to deploy SCOW according to the docs [https://pkuhpc.github.io/SCOW/docs/deploy], but some errors occur when i run ./cli compose up. I wish to have your help. Here are the detai…