GoogleCloudPlatform / kubeflow-distribution

Blueprints for Deploying Kubeflow on Google Cloud Platform and Anthos
Apache License 2.0
78 stars 63 forks source link

pipelines is ready test is failing #59

Closed jlewi closed 4 years ago

jlewi commented 4 years ago

https://k8s-testgrid.appspot.com/sig-big-data#kubeflow-gcp-blueprints-master-periodic

@Bobgy could you please take a look? It could be the case that the test needs to be updated.

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.91

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
platform/gcp 0.57

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
platform/gcp 0.57

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
platform/gcp 0.57

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
platform/gcp 0.57

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

Bobgy commented 4 years ago

@jlewi Is this the entrypoint of the test? https://github.com/kubeflow/kfctl/blob/7bfe692bdfb42002073c0ea196c5942e606ed48c/py/kubeflow/kfctl/testing/pytests/kf_is_ready_test.py#L96

jlewi commented 4 years ago

@Bobgy yes; if you set your current kubectl context to point at a cluster you should be able to run that test locally which may help debug it.

Bobgy commented 4 years ago

@jlewi I looked at auto deployed clusters and found the root cause is: persistence disks were created in us-central1-f zone, while nodes were in us-central1-{a,b,c} zones, so PVs cannot be mounted to pods.

Do you have any ideas why they are deployed to different zones?

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
area/kfctl 0.72

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

jlewi commented 4 years ago

@Bobgy Its because we are using a regional cluster so it has nodes it multiple zones. It looks like kubeflow/gcp-blueprints#6 we still have some work to do to make this work with regional clusters.

The short term fix would be to change the autodeployments to use a zonal cluster. I think we could change the defaults here https://github.com/kubeflow/testing/blob/637d1cd5fe33d03ee5646380d960fabe8a230d0a/py/kubeflow/testing/create_kf_from_gcp_blueprint.py#L69

I don't think the blueprint reconciler is overwriting the defaults

jlewi commented 4 years ago

@Bobgy Would you mind submitting a PR to try to change the auto-deployer to use a zonal cluster?

Bobgy commented 4 years ago

It's been Chinese holidays for three days, I will return to work on Sunday (a working day).

jlewi commented 4 years ago

@Bobgy thanks for the heads up; I've been OOO as well; enjoy the holiday; we can fix this next week.

jlewi commented 4 years ago

e

jlewi commented 4 years ago

Lets not close this until we have a passing green.

Bobgy commented 4 years ago

@jlewi The test is still failing, I looked at the log and found that the test script no longer matches current deployment name. https://github.com/kubeflow/kfctl/blob/baf59c2692f45847bbd042c78a751a761b2b7eaa/py/kubeflow/kfctl/testing/pytests/kf_is_ready_test.py#L96-L106

ml-pipeline-viewer-controller-deployment is now called ml-pipeline-viewer-crd. Can I go ahead and update that test? Why is it in kfctl repo? Will there be backward compatibility concerns?

Bobgy commented 4 years ago

Also we should probably update that list with new deployments in KFP service.

Bobgy commented 4 years ago

@jlewi friendly ping

jlewi commented 4 years ago

@Bobgy Yes please go ahead and update the tests is necessary to work with the current version of KFP. You can consider the code location to be a historical accident. So feel free to move the pipelines code somewhere else if it makes more sense.

Bobgy commented 4 years ago

I'm not very familiar with how to move tests so I'm going to send a PR to fix tests first.

jlewi commented 4 years ago

@Bobgy moving the tests just means

  1. Moving the python code to where every you want it to live
  2. Update the appropriate Tekton task e.g. https://github.com/kubeflow/testing/blob/c37f4bb06e268022f41214d5995df7e738e9800b/tekton/templates/tasks/kf-ready-task.yaml#L67

Updating the tekton task might mean

  1. Adding/updating the resources to include the repository where the code now lives
  2. Updating the working directory for the task
  3. Updating the pythonpath for the task
Bobgy commented 4 years ago

let me leave the issue open to try moving that python code. This seems a good issue for me to get better idea of the test infra.

Bobgy commented 4 years ago

actually, I've tracked the task in https://github.com/kubeflow/pipelines/projects/5.

I think we can close this. /close

k8s-ci-robot commented 4 years ago

@Bobgy: Closing this issue.

In response to [this](https://github.com/kubeflow/gcp-blueprints/issues/59#issuecomment-655582467): >actually, I've tracked the task in https://github.com/kubeflow/pipelines/projects/5. > >I think we can close this. >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.