issues
search
GoogleCloudPlatform
/
ai-infra-cluster-provisioning
Apache License 2.0
37
stars
25
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
disable deletion prevention and update readmes
#381
stevenBorisko
opened
2 days ago
0
fix missing region variable in a3 gke example
#380
chengcongdu
opened
1 month ago
1
Update v1 schema pause image
#379
chrishenzie
closed
1 month ago
2
GKE Megatron demo workload
#378
MrGeislinger
closed
1 month ago
1
Merge develop -> main
#377
Chris113113
closed
3 months ago
0
Add NCCL workload for A3 mega and update README guides.
#376
samcmho
closed
3 months ago
2
Update README.md to include user config in docker script
#375
samcmho
closed
4 months ago
0
Update A3 mega terraform example
#374
samcmho
closed
4 months ago
0
Bump version to 1.5.0
#373
Chris113113
closed
4 months ago
0
Add A3-Mega support (#371)
#372
Chris113113
closed
4 months ago
0
Add A3-Mega support
#371
Chris113113
closed
4 months ago
0
Update NCCL link and rename a3-mega GKE in terraform module
#370
samcmho
closed
4 months ago
0
Configure Renovate
#369
renovate-bot
opened
4 months ago
0
Update NCCL dependency version and NCCL config
#368
samcmho
opened
4 months ago
0
Add A3-Megagpu-8g SKU to tool.
#367
Chris113113
closed
4 months ago
0
Merging Develop -> Main for sample_workloads changes
#366
Chris113113
closed
6 months ago
0
Add base container image Jax TCPX
#365
samos123
opened
6 months ago
0
Update lit_gpt commit to PyTorch 2.2
#364
Chris113113
closed
6 months ago
1
Update setup_and_launch_training.sh
#363
samcmho
closed
6 months ago
0
Update README.md for all customers to cover all-to-all
#362
samcmho
closed
6 months ago
0
Update setup_and_launch_training.sh
#361
samcmho
closed
7 months ago
0
Adding a new config parameter to combine layers during FSDP
#360
tejasnagendra
opened
7 months ago
1
Replace hardcoded parameters with environment variables in litgpt_container.sh
#359
samcmho
closed
7 months ago
0
Add NoTCPX flow to nccltest
#358
Chris113113
opened
7 months ago
0
Fix NCCL_SOCKET_IFNAME typo in values.yaml under nccltest/gke
#357
hmhv1222
closed
7 months ago
0
host_maintenance variable is a gSC only config
#356
vponnam
opened
8 months ago
0
host_maintenance_interval config is a gSC specific config
#355
vponnam
opened
8 months ago
0
Pirillo/litgpt nvtx
#354
Chris113113
closed
6 months ago
1
Specify a working GKE version. Update mount for user credential
#353
ultrons
opened
8 months ago
0
NCCL benchmark error
#352
ultrons
opened
8 months ago
0
remove default node pool deletion
#351
stevenBorisko
closed
6 months ago
0
main to develop
#350
stevenBorisko
closed
8 months ago
0
Release v1.4.2
#349
Chris113113
closed
8 months ago
0
Update litgpt LKG, more params for injection
#348
Chris113113
closed
8 months ago
0
Adding a simple Multi-Node Pingpong PyTorch Workload
#347
parambole
closed
8 months ago
3
Adding a simple Multi-Node Pingpong PyTorch Workoad
#346
parambole
closed
8 months ago
0
Add a nccl-test sample workload
#345
Chris113113
closed
8 months ago
0
Fix unsupported envvar are set for SLURM cluster #343
#344
parambole
closed
9 months ago
2
[P2] Unsupported envvar are set for SLURM cluster
#343
parambole
closed
9 months ago
0
Adding SLURM scripts to setup and launch lit-gpt training
#342
parambole
closed
9 months ago
0
Update rxdm image version
#341
Chris113113
closed
9 months ago
0
add reservations to mig-cos
#340
stevenBorisko
opened
9 months ago
0
Adding details to explain MFU calculation
#339
parambole
closed
9 months ago
0
Correct env var n_layers in llmfoundry_container_entrypoint.sh
#338
hmhv1222
closed
9 months ago
0
Add FSDP params, maxSeqLen, dtms, actCkpt, nLayers
#337
hmhv1222
closed
9 months ago
0
Update and correct typo in values.yaml.example
#336
hmhv1222
opened
9 months ago
0
remove profiling setup (currently not used)
#335
gkroiz
closed
9 months ago
0
Small fixes to Lit-GPT demo
#334
gkroiz
closed
9 months ago
0
Adding hello world SLURM example
#333
tejasnagendra
opened
9 months ago
1
Add litgpt readme
#332
Chris113113
closed
9 months ago
0
Next