issues
search
AI-Hypercomputer
/
xpk
xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
81
stars
23
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
XPK update for hybridsim
#102
tonyjohnchen
closed
7 months ago
2
Create Vertex Experiment in workload create
#101
SurbhiJainUSC
closed
7 months ago
0
Ensure proxy and server images are only provided with --use-pathways.
#100
RoshaniN
opened
7 months ago
1
Add flag to restart-on-user-failures, otherwise do not
#99
Obliviour
closed
7 months ago
0
Add gpu_multi_process_run.sh
#98
NinaCai
opened
7 months ago
0
debug dump gcs using gsutil -m
#97
GallagherCommaJack
closed
7 months ago
0
Fix number of nodes in CPUs.
#96
RoshaniN
closed
7 months ago
0
More integ tests: workload create/list/delete and inspector
#95
Obliviour
closed
7 months ago
0
Create Tensorboard instance in Vertex AI in cluster create
#94
SurbhiJainUSC
closed
7 months ago
0
Move to reserved TPU capacity
#93
Obliviour
closed
7 months ago
0
Update xpk.py
#92
sadikneipp
closed
7 months ago
3
Create a queue of nightly / build tests to avoid concurrent tests to step on each other
#91
Obliviour
closed
8 months ago
0
Move always() to be part of the delete step
#90
Obliviour
closed
8 months ago
0
Fixed bugs and added customization to Github workflows for tests
#89
sushmarchandran
closed
8 months ago
0
Create ConfigMap for cluster metadata and add ConfigMap details to xpk inspector
#88
SurbhiJainUSC
closed
8 months ago
0
Nina xpk gpu h100
#87
NinaCai
closed
7 months ago
1
Create service account in cluster create
#86
SurbhiJainUSC
closed
8 months ago
0
Update the CPU machine type to be n1-standard-16 in integ tests
#85
Obliviour
closed
8 months ago
0
Update gke version to 1.29.1-gke.1589017
#84
Obliviour
closed
8 months ago
0
Add retry logic to kueue, jobset, cluster crediential steps and update kueue to version 0.6.1
#83
Obliviour
closed
8 months ago
0
Update default gke version for cluster to "1.29.1-gke.1589000"
#82
SurbhiJainUSC
closed
8 months ago
0
Nightly and build test for XPK
#81
sushmarchandran
closed
8 months ago
1
Add xpk inspector
#80
Obliviour
closed
8 months ago
1
check gsutil installation when debug_dump_gcs is passed
#79
ssusie
closed
8 months ago
0
Update pip xpk to version 0.3
#78
Obliviour
closed
8 months ago
0
Upgrade kueue to 0.6.0 and support it in xpk workload list
#77
Obliviour
closed
8 months ago
0
update v5p-3072 topology
#76
ZhiyuLi-goog
closed
8 months ago
0
Fix xpk_internal_commands in the main container.
#75
RoshaniN
closed
8 months ago
0
Pathways integration with XPK
#74
RoshaniN
closed
8 months ago
5
Export EXIT_CODE from the user provided command, and propagate the error to Cloud Console UI
#73
Obliviour
closed
8 months ago
0
Parallelize mutliple workload deletions
#72
SurbhiJainUSC
closed
8 months ago
0
Bump JobSet to v0.3.2
#71
danielvegamyhre
closed
8 months ago
0
Add debug flag to workload creation to get verbose logging
#70
SurbhiJainUSC
closed
9 months ago
0
Add XPK label to be part of each Pod and Jobset
#69
Obliviour
closed
9 months ago
0
Update xpk.py to support GKE cluster creation for H100.
#68
yangyuwei
closed
7 months ago
1
Update xpk.py to support GKE cluster creation for H100.
#67
yangyuwei
closed
9 months ago
0
Update xpk.py to support GKE cluster creation for H100.
#66
yangyuwei
closed
9 months ago
0
Support workload deletion based on workload status
#65
SurbhiJainUSC
closed
9 months ago
0
Update xpk.py to support GKE cluster creation for H100.
#64
yangyuwei
closed
9 months ago
0
Make a small change for permission test.
#63
yangyuwei
closed
9 months ago
0
Support multiple workload deletion from single workload delete command
#62
SurbhiJainUSC
closed
9 months ago
0
Output dashboard links only for TPU workloads
#61
SurbhiJainUSC
closed
9 months ago
0
Refactor xpk command argument for Disruption handling.
#60
abhinavclemson
closed
9 months ago
0
Add multihost GPU support
#59
michelle-yooh
closed
7 months ago
1
Add argument termination_grace_period_seconds
#58
abhinavclemson
closed
10 months ago
0
Add --tpu-topology flag for specifying custom topology types
#57
Obliviour
opened
10 months ago
1
Adding CPU support in XPK
#56
RoshaniN
closed
9 months ago
0
Add device type and tpu type argument to cacheimage
#55
Obliviour
closed
10 months ago
0
Fix device-type=none bug when using tpu-type
#54
michelle-yooh
closed
10 months ago
0
Upgrade JobSet to v0.3.1
#53
danielvegamyhre
closed
10 months ago
0
Previous
Next