Open jai opened 3 years ago
Wow, thanks a lot for this! Very helpful!
I would also add: Ext-dns record not created in route53:
Delete the istio
app in the argocd dashboard, it will recreate the resource and update the DNS entries.
The deployment of ext-dns must happen in this order:
istio-operator
-> external-dns
-> istio-resources
-> istio
to properly update the DNS entries.
Might be possible to fix with:
annotations:
argocd.argoproj.io/sync-wave: "2"
Any idea why knative
is not synchronizing properly?
I would also add: Ext-dns record not created in route53:
Delete the
istio
app in the argocd dashboard, it will recreate the resource and update the DNS entries. The deployment of ext-dns must happen in this order:istio-operator
->external-dns
->istio-resources
->istio
to properly update the DNS entries.Might be possible to fix with:
annotations: argocd.argoproj.io/sync-wave: "2"
yes there are a few applications/scenarios that need to happen in the correct order it seems. One is definitely the external-dns
- I am trying to be diligent and write down the others as I see them but it's super hard to make sure of one order of events given the sheer number of ArgoCD Applications!
Any idea why
knative
is not synchronizing properly?
There are a few apps that are fighting with K8s - fields going out of sync - I had this with the Knative install in our regular compute cluster too. Below is the ignoreDifferences
we use for our Knative install:
spec:
ignoreDifferences:
- group: rbac.authorization.k8s.io
kind: ClusterRole
jsonPointers:
- /rules
- group: admissionregistration.k8s.io
kind: ValidatingWebhookConfiguration
jsonPointers:
- /webhooks/0/rules
- group: admissionregistration.k8s.io
kind: MutatingWebhookConfiguration
jsonPointers:
- /webhooks/0/rules
The argoflow-aws Knative ArgoCD Application is going out of sync on the following objects:
MutatingWebhookConfiguration
webhook.domainmapping.serving.knative.dev
- webhooks.0.rules.0.resources.1 (domainmappings/status
)
webhook.serving.knative.dev
ValidatingWebhookConfiguration
validation.webhook.domainmapping.serving.knative.dev
validation.webhook.serving.knative.dev
ClusterRole
knative-serving-admin
knative-serving-aggregated-addressable-resolver
Am I the only one having these go out of sync? This isn't the only app - have a few of them, will post the list.
@jai Thanks for the very detailed issue thread you've started here. Sadly I haven't had much time to dedicate to the ArgoFlow repositories since starting my new job. However, there are a lot of very big Kubeflow improvements I'm working on. Basically it's a completely redesigned architecture that simplifies Kubeflow and adds better security and more advanced features around User/Group/Project management.
Regarding the KNative manifests, they are quite a pain, especially with Kustomize. I've got a Helm chart that should be usable instead, that would should get rid of this continuous syncing issue. Would you like to help move the KNative deployment over to Helm? If so, I can clean up the chart a little bit and add it to a registry for you to depend on.
ArgoCD Applications that are flip-flopping - not sure what the technical term is. Basically ArgoCD installs one manifest the the cluster seems to override some values, causing an update tug-of-war kind of thing. I will post details of which resources are causing this:
aws-eks-resources
istio-resources
kfserving
knative
notebook-controller
pipelines
pod-defaults
roles
@jai Thanks for the very detailed issue thread you've started here. Sadly I haven't had much time to dedicate to the ArgoFlow repositories since starting my new job. However, there are a lot of very big Kubeflow improvements I'm working on. Basically it's a completely redesigned architecture that simplifies Kubeflow and adds better security and more advanced features around User/Group/Project management.
Regarding the KNative manifests, they are quite a pain, especially with Kustomize. I've got a Helm chart that should be usable instead, that would should get rid of this continuous syncing issue. Would you like to help move the KNative deployment over to Helm? If so, I can clean up the chart a little bit and add it to a registry for you to depend on.
Does argoflow/argoflow-aws use vanilla Knative? If I understand what you're saying, we would have to maintain a Helm repo with the Knative manifests, which sounds like one more thing to maintain. Is there a way we can point it at the Knative Operator and then just install a CRD? I might be way off base since I've only been working with Argoflow/Kubeflow for a couple of weeks 😂
What you're saying is completely correct. The Knative Operator is probably a good fit to reduce the maintenance overhead. However, I haven't yet had time to look into it. The Istio <-> Knative <-> KFServing interplay is very fragile and took a couple weeks to get working properly (which also hasn't been upstreamed yet), so implementing the Knative Operator would need some special attention and testing.
What you're saying is completely correct. The Knative Operator is probably a good fit to reduce the maintenance overhead. However, I haven't yet had time to look into it. The Istio <-> Knative <-> KFServing interplay is very fragile and took a couple weeks to get working properly (which also hasn't been upstreamed yet), so implementing the Knative Operator would need some special attention and testing.
I'm at an early-stage startup so my availability is super patchy - I wouldn't want to start something and leave it hanging halfway. I will poke around at the KFServing/Knative parts and see what's going on - no promises I can take this on but I will always do what I can!
Update - also running into this issue: https://github.com/kserve/kserve/issues/848
Update - I think I've whittled it down to stuff that I think can be addressed by ignoreDifferences
in the ArgoCD Application CRD. I'll open a draft PR to see if that's the best way to address these issues or if there's a better way to fix them upstream/in other areas.
Update - ignoreDifferences is done, I'm currently validating and will submit PRs. Sorry for the long lead time!
We're setting up Kubeflow (argoflow-aws) from scratch, including the infrastructure and hit some stumbling blocks along the way. Wanted to document them all here (for now) and address as needed with PRs etc.
I realize that #84 exists, happy to merge into there but I'm not sure that issue deals with the specific 0.1.6 tag. That might be part of my issue as well since some things are more up-to-date on the master branch.
Current issues (can be triaged and split into separate issues or merged into existing issues)
❌ OPEN ISSUES
These are mainly based off of broken functionality or application statuses in ArgoCD
knative
mpi-operator
(https://github.com/kubeflow/mpi-operator)mpi-operator
does:Logs
aws-eks-resources
ignoreDifferences
)✅ SOLVED ISSUES
[✅ SOLVED]
oauth2-proxy
kubeflow_oidc_cookie_secret
output variable)[✅ SOLVED]
pipelines
setup.conf
must NOT be quoted[✅ SOLVED]
aws-load-balancer-controller
[✅ SOLVED] Central Dashboard
Initializing Kubernetes configuration Unable to fetch Application information: 404 page not found
"aws" is not a supported platform for Metrics Using Profiles service at http://profiles-kfam.kubeflow:8081/kfam Server listening on port http://localhost:8082 (in production mode) Unable to fetch Application information: 404 page not found 2021-09-14T02:39:12.655692792Z