GoogleCloudDataproc / initialization-actions

Run in all nodes of your cluster before the cluster starts - lets you customize your cluster
https://cloud.google.com/dataproc/init-actions
Apache License 2.0
588 stars 512 forks source link

update xgboost 171 to 176 #1070

Closed nvliyuan closed 1 year ago

nvliyuan commented 1 year ago

Signed-off-by: liyuan yuali@nvidia.com update xgboost version from 171 to 176 due to https://github.com/dmlc/xgboost/issues/9374

nvliyuan commented 1 year ago

@jayadeep-jayaraman @cjac could you help build? FYI @viadea

cjac commented 1 year ago

/gcbrun

nvliyuan commented 1 year ago

Hi @cjac could you help check why the build failed?

cjac commented 1 year ago

2023-07-15T13:42:37.842963232Z [ FAILED ] SparkRapidsTestCase.test_install_gpu_with_mig('STANDARD', ['m', 'w-0', 'w-1'], None, 'type=nvidia-tesla-a100', 'NVIDIA', 'us-central1-b')

2023-07-15T13:42:37.842993642Z FAIL: test_install_gpu_with_mig('STANDARD', ['m', 'w-0', 'w-1'], None, 'type=nvidia-tesla-a100', 'NVIDIA', 'us-central1-b') (main.SparkRapidsTestCase)

2023-07-15T13:42:37.843415997Z ERROR: (gcloud.dataproc.clusters.create) Operation [projects/cloud-dataproc-ci/regions/us-central1/operations/a8ea9c21-2c9d-362a-a3da-2b0ef9156bb9] failed: The zone 'projects/cloud-dataproc-ci/zones/us-central1-b' does not have enough resources available to fulfill the request. '(resource type:compute)'..

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

you might find better results in the long term if you change that a100 to a t4

nvliyuan commented 1 year ago

The MIG Test needs specific A100 or A30 GPUs, and A30 is not supported according to this doc, seems we can only attach A100. Is it possible to add more A100?

cjac commented 1 year ago

I think if we add more, they will be consumed. The demand is super high. I know we have some queue software for TPUs, but nothing for NVIDIA hardware yet. I'll just keep retrying.

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

again:

2023-07-18T04:22:07.805359170Z ERROR: (gcloud.dataproc.clusters.create) Operation [projects/cloud-dataproc-ci/regions/us-central1/operations/707aeac4-0f7a-35fa-ba68-c15505261d74] failed: The zone 'projects/cloud-dataproc-ci/zones/us-central1-b' does not have enough resources available to fulfill the request. '(resource type:compute)'..

Dagang, is it possible to retry on this error?

cjac commented 1 year ago

/gcbrun

viadea commented 1 year ago

@cjac @nvliyuan what is the status of this PR?thx

@SurajAralihalli FYI

cjac commented 1 year ago

@jayadeep-jayaraman - can you look at this with me next week?

cjac commented 1 year ago

/gcbrun

nvliyuan commented 1 year ago

@jayadeep-jayaraman @cjac I disabled MIG-related tests due to the lack of A100 GPU temporarily, we can merge the pr first . CC @viadea @SurajAralihalli

cjac commented 1 year ago

/gcbrun

viadea commented 1 year ago

I am fine.

@SurajAralihalli FYI

nvliyuan commented 1 year ago

Seems the checks have passed after we disable MIG tests, could you help merge? @cjac

nvliyuan commented 1 year ago

Hi @viadea, could you please help review the pr?

viadea commented 1 year ago

Hi @viadea, could you please help review the pr?

LGTM

viadea commented 1 year ago

@cjac @jayadeep-jayaraman need one more approval