issues
search
AI-Hypercomputer
/
xpk
xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
81
stars
23
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Add logic to fail Pathways jobs on user code errors.
#152
RoshaniN
closed
5 months ago
1
Add a check and update existing Pathways clusters to use CloudDNS.
#151
RoshaniN
closed
5 months ago
0
Move all clusters to be RAPID clusters, and verify them using valid_v…
#150
Obliviour
closed
5 months ago
0
Fixing device_type in nightly tests.
#149
RoshaniN
closed
5 months ago
0
Restrict Pathways unified debugging logs to just first worker
#148
SujeethJinesh
closed
5 months ago
0
Fix stacktrace sidecar container yaml
#147
SurbhiJainUSC
closed
5 months ago
0
Don't Kill RM or Proxy on user job failure
#146
SujeethJinesh
opened
6 months ago
0
Enable cluster and workload creation on A3+.
#145
yangyuwei
closed
5 months ago
1
Add Unified Logging View for Pathways on Cloud
#144
SujeethJinesh
closed
6 months ago
1
Update pathways server and proxy server image locations
#143
sadikneipp
closed
6 months ago
0
Fixing misleading message on Validate Docker Image.
#142
RoshaniN
closed
6 months ago
0
Pathways in headless mode.
#141
RoshaniN
closed
5 months ago
0
Allow JAX coordinator to find the JobSet name.
#140
RoshaniN
closed
6 months ago
0
Making exit flow similar to other XPK commands.
#139
RoshaniN
closed
6 months ago
1
Update pip version to 0.5.0
#138
SurbhiJainUSC
closed
7 months ago
1
Update xpk.py
#137
kyle-google
closed
6 months ago
2
Remove user-managed service account and attach default compute engine service account to node pools
#136
SurbhiJainUSC
closed
7 months ago
0
Dynamically determine GKE Version for Cluster and Node Pool Creation
#135
Obliviour
closed
7 months ago
0
Prevent Pathways SIGTERMs from counting against backoffLimit
#134
SujeethJinesh
opened
7 months ago
0
enable create workload for h150
#133
NinaCai
opened
7 months ago
0
Fix incorrect indent in workload list output
#132
Obliviour
closed
7 months ago
0
Disable service account feature
#131
SurbhiJainUSC
closed
7 months ago
1
Set gcloud zone property for build and nightly tests
#130
SurbhiJainUSC
closed
7 months ago
0
Add project flag to service account commands
#129
SurbhiJainUSC
closed
7 months ago
0
Add project flag to service account commands and add random gcloud properties to integ tests
#128
SurbhiJainUSC
closed
7 months ago
0
Add Support for Pathways Expected Instances & Larger Default Worker Backoff Limit
#127
SujeethJinesh
closed
7 months ago
0
Import cloud-accelerator-diagnostics only when Vertex AI Tensorboard flag is set
#126
SurbhiJainUSC
closed
7 months ago
0
Add configuration setting for default pool num nodes
#125
Obliviour
closed
7 months ago
0
Add custom env variables to CPU workloads.
#124
RoshaniN
closed
7 months ago
0
Enable formating with pyink to adhere with google3 style.
#123
Obliviour
closed
7 months ago
0
Update README with Vertex AI Tensorboard information and update pip version to 0.4.0
#122
SurbhiJainUSC
closed
7 months ago
1
XPK cleanup: integ tests and code cleanup
#121
Obliviour
opened
7 months ago
0
Check cluster arguments and update nodepools in existing cluster when requesting different device_type
#120
SurbhiJainUSC
closed
6 months ago
0
Fix None docker_name and wait-for-workload-completition poll mode
#119
Obliviour
closed
7 months ago
0
Revert "Nina/unify gpu container yaml"
#118
NinaCai
closed
7 months ago
2
Add timeout=0 to readme
#117
raymondzouu
closed
7 months ago
0
Add wait-for-job-completion to integration test
#116
raymondzouu
closed
7 months ago
0
Nina/unify gpu container yaml
#115
NinaCai
closed
7 months ago
0
CPU shared clusters for Llama and Mistral model runs.
#114
RoshaniN
closed
7 months ago
0
Add warning when user schedules workload on a cluster created using previous XPK version
#113
SurbhiJainUSC
closed
7 months ago
0
return pid exit code if it is non-zero
#112
NinaCai
closed
7 months ago
0
Delete subnets when deleting the cluster
#111
NinaCai
closed
7 months ago
0
Change tensorboard_location to tensorboard_region for compatibility
#110
SurbhiJainUSC
closed
7 months ago
0
Retry again with longer wait times for kueue credentials step
#109
Obliviour
closed
7 months ago
0
Add --second-docker-image option
#108
tonyjohnchen
closed
7 months ago
2
Add workload list wait-for-job-completion feature
#107
raymondzouu
closed
7 months ago
0
Enable Autoprovisioning Support in XPK
#106
Obliviour
closed
7 months ago
0
Add dynamic versioning for pip package
#105
SurbhiJainUSC
closed
7 months ago
0
Support --env flag and Artifact Registry image validation
#104
jonb377
closed
7 months ago
0
Add Pathways end-to-end tests to build tests and nightly tests.
#103
RoshaniN
closed
7 months ago
0
Previous
Next