-
We are building supercomputing infra for an internal GPU cluster up to thousands of expensive GPUs.
We are looking to adopt mpi-operator, or slurm.
slurm is widely adopted in large-scale hpc compu…
-
**Describe the bug**
The slinkee operator gets stuck deploying a slinkee cluster when there are nodes with taints that will not have a slurmabler deployed on them. The reason is that the operator wai…
-
### What you would like to be added?
As we discussed during the last Training WG call, we want to design and implement Training Runtime for Slurm, so users can leverage Slurm workload manager for m…
-
full-slurmabler pods are up but nothing happens
logs from slik operator :-
2024-07-04T08:23:01.645Z INFO slurm/create_slurmabler.go:102 github.com/vultr/slik/pkg/slurm.buildSlurmablerDaemonSe…
-
I'm interested in dynamically adding an removing nodes to a Flux deployment running in Slurm. I'm aware that similar functionality exists for K8s (https://flux-framework.org/flux-operator/). Since Slu…
-
Currently, the `SlurmHook` collaborates with the `ResourceExecutor` and the resource operators `ResourceBashOperator` and `ResourceGmxOperator` to manage the cores and GPUs assigned to different tasks…
-
have a try to reproduce Nvidia's results on using slurm + enroot + pyxis
1. downgrade the transformers and huggingface_hub libs (huggingface_hub==0.23.2 transformers==4.40.2) because the versions…
-
I am having an issue similar to #213 which wasn't resolved. I tried to use a nodups.pairs file generated with _pairtools_ to run calculate_map_resolution.sh. I had the following output:
```
../opt…
-
**Context:**
I wanted to write an integration test to go with PR #522, but couldn't find a way to print the operator summary to a file. This is because `write()`/`print()` expects a Rev object, and…
-
老师您好~
我在使用LBM进行批量模拟数字岩心的绝对渗透率时,经常会遇到一个报错
`Traceback (most recent call last):
File "/lustre/home/2001110637/LBM/LBM.py", line 1067, in
init()
File "/lustre/home/2001110637/.conda/envs/tes…