-
Hello 👋 Thank you for considering this feature request :) I have been looking over dask-jobqueue (together with prefect) to allocate resources on a Slurm cluster I have access to. dask-jobqueue seems…
-
To support a compute cluster on cloudmesh, we are offering a new design of the orchestration for clusters with big data. We will investigate OpenStack Heat, Chef, Puppet and Docker to see if their are…
-
Slurm clusters mount a single storage cluster used for home directories.
This ticket would allow Slurm and Kubernetes to mount additional data volumes.
-
We have existing SLURM documentation here https://rapids.ai/hpc. We should migrate this to the deployment documentation here.
Create a new documentation page called `source/hpc/slurm.md` with instruc…
-
You need to change line 108 of src/miniwdl_slurm/__init__.py as you are using the wrong slurm command line flag
Current value:
srun_args.extend(["--cpus-per-task", str(cpu)])
Change to:
srun_a…
-
We plan to use [UCSB's HPC clusters](https://csc.cnsi.ucsb.edu/clusters) to run the analysis on the new species (and maybe all species if the workflow changes a lot). There are several options, some o…
-
## 🐛 Bug
Gemma-7b with FSDP zero3 trained on 2 nodes with 8 H100 each gives OOM error for BS = 2 for both `thunder_cudnn` and `thunder_inductor_cat_cudnn`. The same configuration works for `inducto…
-
Thank you for sharing the fantastic work.
As I do not have the SLURM cluster, Is there the DDP training code?
Or anyone can help?
-
Slurm is a utility to manage and schedule workloads on a cluster of computers.
Many academic institutions use it for distributing computation.
I was wondering if it would be a good idea to impleme…
-
I try to pretrain blip2 on a slurm cluster, but it seems that the current programme does not support distributed training on slurm by default. Any advice on it?
| distributed init (rank 0, world 1)…