Closed nvliyuan closed 1 year ago
@jayadeep-jayaraman @cjac could you help build? FYI @viadea
/gcbrun
Hi @cjac could you help check why the build failed?
2023-07-15T13:42:37.842963232Z [ FAILED ] SparkRapidsTestCase.test_install_gpu_with_mig('STANDARD', ['m', 'w-0', 'w-1'], None, 'type=nvidia-tesla-a100', 'NVIDIA', 'us-central1-b')
2023-07-15T13:42:37.842993642Z FAIL: test_install_gpu_with_mig('STANDARD', ['m', 'w-0', 'w-1'], None, 'type=nvidia-tesla-a100', 'NVIDIA', 'us-central1-b') (main.SparkRapidsTestCase)
2023-07-15T13:42:37.843415997Z ERROR: (gcloud.dataproc.clusters.create) Operation [projects/cloud-dataproc-ci/regions/us-central1/operations/a8ea9c21-2c9d-362a-a3da-2b0ef9156bb9] failed: The zone 'projects/cloud-dataproc-ci/zones/us-central1-b' does not have enough resources available to fulfill the request. '(resource type:compute)'..
/gcbrun
you might find better results in the long term if you change that a100 to a t4
I think if we add more, they will be consumed. The demand is super high. I know we have some queue software for TPUs, but nothing for NVIDIA hardware yet. I'll just keep retrying.
/gcbrun
/gcbrun
again:
2023-07-18T04:22:07.805359170Z ERROR: (gcloud.dataproc.clusters.create) Operation [projects/cloud-dataproc-ci/regions/us-central1/operations/707aeac4-0f7a-35fa-ba68-c15505261d74] failed: The zone 'projects/cloud-dataproc-ci/zones/us-central1-b' does not have enough resources available to fulfill the request. '(resource type:compute)'..
Dagang, is it possible to retry on this error?
/gcbrun
@cjac @nvliyuan what is the status of this PR?thx
@SurajAralihalli FYI
@jayadeep-jayaraman - can you look at this with me next week?
/gcbrun
@jayadeep-jayaraman @cjac I disabled MIG-related tests due to the lack of A100 GPU temporarily, we can merge the pr first . CC @viadea @SurajAralihalli
/gcbrun
I am fine.
@SurajAralihalli FYI
Seems the checks have passed after we disable MIG tests, could you help merge? @cjac
Hi @viadea, could you please help review the pr?
Hi @viadea, could you please help review the pr?
LGTM
@cjac @jayadeep-jayaraman need one more approval
Signed-off-by: liyuan yuali@nvidia.com update xgboost version from 171 to 176 due to https://github.com/dmlc/xgboost/issues/9374