argoflow / argoflow-aws

Argoflow-AWS has been superseded by deployKF
GNU Affero General Public License v3.0
44 stars 29 forks source link

[0.1.6] Deploying argoflow-aws #227

Open jai opened 3 years ago

jai commented 3 years ago

We're setting up Kubeflow (argoflow-aws) from scratch, including the infrastructure and hit some stumbling blocks along the way. Wanted to document them all here (for now) and address as needed with PRs etc.

I realize that #84 exists, happy to merge into there but I'm not sure that issue deals with the specific 0.1.6 tag. That might be part of my issue as well since some things are more up-to-date on the master branch.

Current issues (can be triaged and split into separate issues or merged into existing issues)

❌ OPEN ISSUES

These are mainly based off of broken functionality or application statuses in ArgoCD

knative

mpi-operator (https://github.com/kubeflow/mpi-operator)

The MPI Operator makes it easy to run allreduce-style distributed training on Kubernetes.

aws-eks-resources

✅ SOLVED ISSUES

[✅ SOLVED] oauth2-proxy

[✅ SOLVED] pipelines

[✅ SOLVED] aws-load-balancer-controller

[✅ SOLVED] Central Dashboard

kubeflow-centraldashboard@0.0.2 serve /app node dist/server.js

Initializing Kubernetes configuration Unable to fetch Application information: 404 page not found

"aws" is not a supported platform for Metrics Using Profiles service at http://profiles-kfam.kubeflow:8081/kfam Server listening on port http://localhost:8082 (in production mode) Unable to fetch Application information: 404 page not found 2021-09-14T02:39:12.655692792Z


* Update - seems we shouldn't port-forward into the dashboard. However `aws-load-balancer-controller` has an issue (see below)
* **Solution**: the dashboard cannot be accessed using `kubectl port-forward` but rather needs to be accessed through the proper URL of `<<__subdomain_dashboard__>.<<__domain__>>`

### [✅ SOLVED] `kube-prometheus-stack`

- **Impact: Low**
- `kube-prometheus-stack-grafana` ConfigMap and Secret are going out of sync (in ArgoCD), which causes checksums in the Deployment to go out of sync as well
- Was an issue on v0.1.6, resolved by deploying `master` (b90cb8af46439f25a15306975fc99fd42a06f378)
EKami commented 3 years ago

Wow, thanks a lot for this! Very helpful!

EKami commented 3 years ago

I would also add: Ext-dns record not created in route53:

Delete the istio app in the argocd dashboard, it will recreate the resource and update the DNS entries. The deployment of ext-dns must happen in this order: istio-operator-> external-dns -> istio-resources -> istio to properly update the DNS entries.

Might be possible to fix with:

  annotations:
    argocd.argoproj.io/sync-wave: "2"
EKami commented 3 years ago

Any idea why knative is not synchronizing properly?

jai commented 3 years ago

I would also add: Ext-dns record not created in route53:

Delete the istio app in the argocd dashboard, it will recreate the resource and update the DNS entries. The deployment of ext-dns must happen in this order: istio-operator-> external-dns -> istio-resources -> istio to properly update the DNS entries.

Might be possible to fix with:

  annotations:
    argocd.argoproj.io/sync-wave: "2"

yes there are a few applications/scenarios that need to happen in the correct order it seems. One is definitely the external-dns - I am trying to be diligent and write down the others as I see them but it's super hard to make sure of one order of events given the sheer number of ArgoCD Applications!

jai commented 3 years ago

Any idea why knative is not synchronizing properly?

There are a few apps that are fighting with K8s - fields going out of sync - I had this with the Knative install in our regular compute cluster too. Below is the ignoreDifferences we use for our Knative install:

spec:
  ignoreDifferences:
  - group: rbac.authorization.k8s.io
    kind: ClusterRole
    jsonPointers:
    - /rules
  - group: admissionregistration.k8s.io
    kind: ValidatingWebhookConfiguration
    jsonPointers:
    - /webhooks/0/rules
  - group: admissionregistration.k8s.io
    kind: MutatingWebhookConfiguration
    jsonPointers:
    - /webhooks/0/rules

The argoflow-aws Knative ArgoCD Application is going out of sync on the following objects:


Am I the only one having these go out of sync? This isn't the only app - have a few of them, will post the list.

davidspek commented 3 years ago

@jai Thanks for the very detailed issue thread you've started here. Sadly I haven't had much time to dedicate to the ArgoFlow repositories since starting my new job. However, there are a lot of very big Kubeflow improvements I'm working on. Basically it's a completely redesigned architecture that simplifies Kubeflow and adds better security and more advanced features around User/Group/Project management.

Regarding the KNative manifests, they are quite a pain, especially with Kustomize. I've got a Helm chart that should be usable instead, that would should get rid of this continuous syncing issue. Would you like to help move the KNative deployment over to Helm? If so, I can clean up the chart a little bit and add it to a registry for you to depend on.

jai commented 3 years ago

ArgoCD Applications that are flip-flopping - not sure what the technical term is. Basically ArgoCD installs one manifest the the cluster seems to override some values, causing an update tug-of-war kind of thing. I will post details of which resources are causing this:

jai commented 3 years ago

@jai Thanks for the very detailed issue thread you've started here. Sadly I haven't had much time to dedicate to the ArgoFlow repositories since starting my new job. However, there are a lot of very big Kubeflow improvements I'm working on. Basically it's a completely redesigned architecture that simplifies Kubeflow and adds better security and more advanced features around User/Group/Project management.

Regarding the KNative manifests, they are quite a pain, especially with Kustomize. I've got a Helm chart that should be usable instead, that would should get rid of this continuous syncing issue. Would you like to help move the KNative deployment over to Helm? If so, I can clean up the chart a little bit and add it to a registry for you to depend on.

Does argoflow/argoflow-aws use vanilla Knative? If I understand what you're saying, we would have to maintain a Helm repo with the Knative manifests, which sounds like one more thing to maintain. Is there a way we can point it at the Knative Operator and then just install a CRD? I might be way off base since I've only been working with Argoflow/Kubeflow for a couple of weeks 😂

davidspek commented 3 years ago

What you're saying is completely correct. The Knative Operator is probably a good fit to reduce the maintenance overhead. However, I haven't yet had time to look into it. The Istio <-> Knative <-> KFServing interplay is very fragile and took a couple weeks to get working properly (which also hasn't been upstreamed yet), so implementing the Knative Operator would need some special attention and testing.

jai commented 3 years ago

What you're saying is completely correct. The Knative Operator is probably a good fit to reduce the maintenance overhead. However, I haven't yet had time to look into it. The Istio <-> Knative <-> KFServing interplay is very fragile and took a couple weeks to get working properly (which also hasn't been upstreamed yet), so implementing the Knative Operator would need some special attention and testing.

I'm at an early-stage startup so my availability is super patchy - I wouldn't want to start something and leave it hanging halfway. I will poke around at the KFServing/Knative parts and see what's going on - no promises I can take this on but I will always do what I can!

jai commented 3 years ago

Update - also running into this issue: https://github.com/kserve/kserve/issues/848

jai commented 3 years ago

Update - I think I've whittled it down to stuff that I think can be addressed by ignoreDifferences in the ArgoCD Application CRD. I'll open a draft PR to see if that's the best way to address these issues or if there's a better way to fix them upstream/in other areas.

jai commented 2 years ago

Update - ignoreDifferences is done, I'm currently validating and will submit PRs. Sorry for the long lead time!