issues
search
AI-Hypercomputer
/
xpk
xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
70
stars
18
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Update the CloudDNS check.
#194
lukebaumann
closed
1 day ago
0
Add a troubleshooting tip
#193
guptaaka
closed
2 days ago
0
Introduce Storage API
#192
PBundyra
opened
2 days ago
0
Consider configuring kueue waitForPodsReady
#191
avrittrohwer
opened
4 days ago
2
Support setting node auto-provisioning cpu and memory parameters
#190
avrittrohwer
opened
4 days ago
0
Fix GKE node version selection
#189
44past4
opened
1 week ago
1
Fix GKE node version selection
#188
44past4
closed
1 week ago
0
Fix autoprovisioning with spot nodes
#187
avrittrohwer
opened
1 week ago
1
Fix autoprovisioning with spot nodes
#186
avrittrohwer
closed
1 week ago
1
Fix GKE node version selection logic
#185
44past4
closed
1 week ago
1
better core dump for debugging
#184
ZhiyuLi-goog
opened
2 weeks ago
0
Fix debug logging (--enable-debug-logs)
#183
Obliviour
closed
3 weeks ago
0
Fixes a typo in the base command description
#182
lukebaumann
closed
3 weeks ago
0
Fix debug logging
#181
Obliviour
closed
3 weeks ago
1
Add Zarr Flag for Pathways
#180
SujeethJinesh
closed
1 month ago
1
Trillium device support
#179
Obliviour
closed
2 weeks ago
1
Added advanced usage example for a notebook interacting with a Cloud …
#178
nhira
closed
1 month ago
0
Enabling Workload Identity and GCSFuse driver flags added.
#177
sharabiani
closed
3 weeks ago
1
Pbundyra refactor commands
#176
PBundyra
closed
1 month ago
0
Create `commands` package and `core/` modules for NAP, Kueue and Pathways
#175
PBundyra
closed
1 month ago
0
Add quotes even to example output to help devs who copy commands from…
#174
nhira
closed
1 month ago
0
Add quotes even in example output to help devs who copy commands from the example output comments
#173
nhira
closed
1 month ago
0
Update RxDM image version from v1.0.8 to v1.0.9.
#172
yangyuwei
closed
1 month ago
0
Update RxDM image version from v1.0.8 to v1.0.9.
#171
yangyuwei
closed
1 month ago
1
Move SystemCharacteristics to a separate module
#170
PBundyra
closed
1 month ago
0
Allow debug_dump_gcs to be specified with other XLA_FLAGS
#169
jonb377
opened
1 month ago
0
Create `parser` package. Move logic from `xpk.py` to `parser` package.
#168
PBundyra
closed
1 month ago
1
Fix issue with device check failure
#167
jonb377
closed
1 month ago
1
Create `xpk` package with `utils` module
#166
PBundyra
closed
1 month ago
3
Create xpk package, utils module and refactor
#165
PBundyra
closed
2 weeks ago
0
Enabling Workload Identity and GCSFuse driver flags
#164
sharabiani
closed
1 month ago
0
Python3.10 fix - use CSV format for gcloud commands to simplify parsing
#163
nhira
closed
1 month ago
1
Allow SIGTERM error code to be returned from XPK
#162
Obliviour
closed
1 month ago
0
Create cluster from several reservations
#161
DwarKapex
opened
2 months ago
1
Fix non-accelerator pools from being part of accelerator node pool cr…
#160
Obliviour
closed
2 weeks ago
0
Use csv formatting instead in the gcloud command to split the names o…
#159
Obliviour
closed
1 month ago
1
xpk Cluster Queue resource group "cpu" resource quota incorrect for a CPU-only cluster
#158
bernardhan33
opened
2 months ago
11
Correct Suspend/Resume backoffLimit for Pathways
#157
SujeethJinesh
closed
2 months ago
3
Remove flag `pathways_compilation_mode` from xpk.py
#156
norx1991
closed
3 months ago
0
Remove incorrect plural from filter-by-job
#155
Obliviour
closed
3 months ago
0
Update XPK to support topology-aware scheduler for GPU workloads.
#154
yangyuwei
closed
3 months ago
1
Update the CloudDNS check.
#153
lukebaumann
closed
1 day ago
3
Add logic to fail Pathways jobs on user code errors.
#152
RoshaniN
closed
3 months ago
1
Add a check and update existing Pathways clusters to use CloudDNS.
#151
RoshaniN
closed
3 months ago
0
Move all clusters to be RAPID clusters, and verify them using valid_v…
#150
Obliviour
closed
4 months ago
0
Fixing device_type in nightly tests.
#149
RoshaniN
closed
4 months ago
0
Restrict Pathways unified debugging logs to just first worker
#148
SujeethJinesh
closed
4 months ago
0
Fix stacktrace sidecar container yaml
#147
SurbhiJainUSC
closed
4 months ago
0
Don't Kill RM or Proxy on user job failure
#146
SujeethJinesh
opened
4 months ago
0
Enable cluster and workload creation on A3+.
#145
yangyuwei
closed
4 months ago
1
Next