issues
search
centerforaisafety
/
cerberus-cluster
HPC cluster code and configurations for running on OCI
Universal Permissive License v1.0
4
stars
0
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Update playbooks to not update node state when state is `mix`
#270
andriy-safe-ai
closed
7 months ago
0
Set /tmp as default download location for SLURM RPMs.
#269
andriy-safe-ai
closed
1 week ago
1
Update playbooks to download SLURM RPM files to `/tmp` always
#268
andriy-safe-ai
opened
7 months ago
0
Added onboarding script
#267
andriy-safe-ai
closed
7 months ago
0
Add onboarding script
#266
andriy-safe-ai
closed
7 months ago
0
258 add dcgm low level gpu resource utilization
#265
andriy-safe-ai
closed
8 months ago
1
Add ability to update login shell
#264
steven-safeai
closed
8 months ago
0
262 nix fix
#263
andriy-safe-ai
closed
8 months ago
3
Nix Fix
#262
andriy-safe-ai
closed
8 months ago
0
260 rebase repo with oci hpc
#261
andriy-safe-ai
closed
8 months ago
0
Rebase repo with OCI HPC
#260
andriy-safe-ai
closed
8 months ago
0
Hardcode the OL version to mitigate the change on Hashicorp repo
#259
andriy-safe-ai
closed
8 months ago
1
Add DCGM/ low level GPU resource utilization.
#258
WilliamHodgkins
closed
8 months ago
0
Documentation to help new people to figure out fine tuning with DeepSpeed and Huggingface Accelerate
#257
WilliamHodgkins
opened
9 months ago
0
Updated documentation on cluster website page to provide more information on what cluster can do, on security/ privacy, etc.
#256
WilliamHodgkins
opened
9 months ago
0
Reduce storage used in shared "Private models" folder
#255
WilliamHodgkins
opened
9 months ago
0
Ensure that most commonly used models are stored in shared Public Models folder
#254
WilliamHodgkins
opened
9 months ago
0
Add error logging, monitoring, and alerting to the billing system
#253
andriy-safe-ai
opened
10 months ago
0
250 billing system
#252
andriy-safe-ai
closed
5 months ago
3
Install mosh on the login node
#251
steven-safeai
opened
11 months ago
0
Billing System
#250
andriy-safe-ai
closed
5 months ago
0
Build slurm resource/charge back model to be able to calculate per user cost.
#249
ghost
closed
2 weeks ago
8
Cost report based on user compute based on Slurm TRES.
#248
ghost
closed
2 weeks ago
0
Alert monitoring in OCI for networking egress.
#247
ghost
opened
11 months ago
0
Upgrade ProdOKE add one more node
#246
ghost
closed
8 months ago
2
added play to install texlive on all nodes
#245
andriy-safe-ai
closed
11 months ago
6
Install the texlive package
#244
andriy-safe-ai
opened
11 months ago
0
Deploy new OCI image(s) in Cerberus Cluster
#243
ghost
closed
6 months ago
0
Prepare for new Image OCI in Cerberus
#242
ghost
closed
6 months ago
3
Do diff between BM oracle base image and our running compute nodes.
#241
ghost
closed
11 months ago
2
Spin up new image in Dev env.
#240
ghost
closed
4 months ago
2
Ask Oracle is anything needs to be addressed
#239
ghost
closed
11 months ago
1
Recreate clean base image.; allow others to log on machine and look around, clean, etc before creating final image.
#238
ghost
closed
11 months ago
1
Create image of existing cerberus compute node.
#237
ghost
closed
11 months ago
1
Create cerberus compute node image
#236
ghost
closed
11 months ago
2
Retest epilog script
#235
steven-safeai
opened
12 months ago
1
Prevent installing tmux on nix
#234
steven-safeai
closed
4 months ago
1
prometheus can't just be one node anymore need to create cluster
#233
ghost
closed
8 months ago
1
Remove Extra Mount Targets
#232
andriy-safe-ai
closed
2 weeks ago
1
205 migrate nix to weka
#231
andriy-safe-ai
closed
1 year ago
0
Isolate /tmp on each node
#230
steven-safeai
opened
1 year ago
0
Change default admin password in Weka
#229
andriy-safe-ai
closed
1 year ago
0
added configure_for_weka.sh script
#228
andriy-safe-ai
closed
1 year ago
4
fix /var/crash filling causing slurm node drain
#227
ghost
opened
1 year ago
5
Update Nvidia drivers and cuda version for pytorch 2.1
#226
steven-safeai
closed
6 months ago
0
Physical GPU Utilization Tracking
#225
ghost
opened
1 year ago
13
Waiting times per User in Slurm
#224
ghost
opened
1 year ago
1
Fair Share Verification
#223
ghost
closed
1 year ago
4
Data Backfilling & Prioritization
#222
ghost
opened
1 year ago
1
Chart Display Accuracy
#221
ghost
closed
11 months ago
3
Previous
Next