argoflow / argoflow-aws

Argoflow-AWS has been superseded by deployKF
GNU Affero General Public License v3.0
44 stars 29 forks source link

Error when deploying Kubeflow (and its components) #84

Open mtszkw opened 3 years ago

mtszkw commented 3 years ago

Hi @DavidSpek @karlschriek and others,

I just forked argoflow-aws repo, configured it and deployed onto my AWS account. I wanted to used pretty basic configuration (in kustomization.yaml) i.e. no external domain, auth etc. I managed to get argoflow up and running, however I cannot see any Kubeflow-related pods or services (I was particularly looking for ingress-gateway as Kubeflow UI). Could you help me? What do I miss?

Kustomization.yaml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  ## Common
  - argocd-applications/argocd.yaml
  - argocd-applications/istio-operator.yaml
  - argocd-applications/istio.yaml    
  #- argocd-applications/knative.yaml
  # - argocd-applications/sealed-secrets.yaml
  # - argocd-applications/oauth2-proxy.yaml

  # Pick *one* of the following applications 
  - argocd-applications/cert-manager-self-signing.yaml
  # - argocd-applications/cert-manager-dns-01.yaml

  # Pick *one* of the following applications:  
  - argocd-applications/oidc-auth-on-cluster-dex.yaml
  #- argocd-applications/oidc-auth-on-cluster-keycloak.yaml
  # - argocd-applications/oidc-auth-external.yaml

  ## Kubeflow  
  - argocd-applications/central-dashboard.yaml
  - argocd-applications/profile-controller_access-management.yaml
  - argocd-applications/kubeflow-namespace.yaml
  - argocd-applications/kubeflow-profiles.yaml
  - argocd-applications/kubeflow-roles.yaml
  - argocd-applications/pipelines-base.yaml
  - argocd-applications/pipelines-iam-user.yaml
  #- argocd-applications/pipelines-iam-roles-for-service-accuonts.yaml
  # #- argocd-applications/pipelines-kube2iam.yaml
  # - argocd-applications/katib.yaml
  # - argocd-applications/kfserving.yaml
  # - argocd-applications/pod-defaults.yaml
  - argocd-applications/jupyter-web-app.yaml
  - argocd-applications/notebook-controller.yaml
  # - argocd-applications/tensorboard-controller.yaml
  # - argocd-applications/tensorboards-web-app.yaml
  # - argocd-applications/volumes-web-app.yaml
  # - argocd-applications/tensorflow-operator.yaml
  # - argocd-applications/pytorch-operator.yaml
  # - argocd-applications/mpi-operator.yaml
  # - argocd-applications/mxnet-operator.yaml
  # - argocd-applications/xgboost-operator.yaml

  ## System
  # - argocd-applications/aws-node-termination-handler.yaml
  - argocd-applications/cluster-autoscaler.yaml
  - argocd-applications/aws-load-balancer-controller.yaml
  # - argocd-applications/external-dns.yaml
  - argocd-applications/external-secrets.yaml

  ## Contrib
  - argocd-applications/mlflow.yaml
  # - argocd-applications/experimental-pvcviewer-controller.yaml
  # - argocd-applications/experimental-volumes-web-app.yaml

Output:

user@X:/mnt/c/Users/abc/argoflow-aws$ kubectl get svc --all-namespaces
NAMESPACE     NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
argocd        argocd-dex-server       ClusterIP   10.100.95.93     <none>        5556/TCP,5557/TCP,5558/TCP   106s
argocd        argocd-metrics          ClusterIP   10.100.157.213   <none>        8082/TCP                     106s
argocd        argocd-redis            ClusterIP   10.100.83.103    <none>        6379/TCP                     106s
argocd        argocd-repo-server      ClusterIP   10.100.104.180   <none>        8081/TCP,8084/TCP            106s
argocd        argocd-server           ClusterIP   10.100.238.19    <none>        80/TCP,443/TCP               105s
argocd        argocd-server-metrics   ClusterIP   10.100.254.179   <none>        8083/TCP                     105s
default       kubernetes              ClusterIP   10.100.0.1       <none>        443/TCP                      10m
kube-system   external-secrets        ClusterIP   10.100.63.100    <none>        3001/TCP                     2m5s
kube-system   kube-dns                ClusterIP   10.100.0.10      <none>        53/UDP,53/TCP                10m
user@X:/mnt/c/Users/abc/argoflow-aws$ kubectl get pod --all-namespaces
NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE
argocd        argocd-application-controller-0       1/1     Running   0          111s
argocd        argocd-dex-server-745d7f54c4-lkxlb    1/1     Running   0          112s
argocd        argocd-redis-5f4d559f65-th7x2         1/1     Running   0          112s
argocd        argocd-repo-server-7b59685f8c-lrq64   1/1     Running   0          112s
argocd        argocd-server-65d7bc9fc8-gqsd7        1/1     Running   0          111s
kube-system   aws-node-4xl6g                        1/1     Running   0          6m39s
kube-system   aws-node-8z97h                        1/1     Running   0          6m43s
kube-system   coredns-59b69b4849-jxdq9              1/1     Running   0          10m
kube-system   coredns-59b69b4849-p699j              1/1     Running   0          10m
kube-system   external-secrets-6cbb666466-pxb66     1/1     Running   0          2m12s
kube-system   kube-proxy-89jqh                      1/1     Running   0          6m39s
kube-system   kube-proxy-v88wd                      1/1     Running   0          6m43s
davidspek commented 3 years ago

Did you apply the kubeflow.yaml file? It looks like all that was deployed is Argo CD.

mtszkw commented 3 years ago

Yes, I did:

./setup_repo.sh examples/setup.conf
kustomize build distribution/external-secrets/ | kubectl apply -f -
kustomize build distribution/argocd/ | kubectl apply -f -
kubectl apply -f distribution/kubeflow.yaml

I just noticed that I used invalid git repo url (with no .git suffix). No idea if this was the issue, but I am now trying to set it all up from scratch.

davidspek commented 3 years ago

The best way to debug stuff is usually the Argo CD UI. Given that nothing was deployed you'll probably see an error with the Kubeflow application in the UI. Personally I don't use the .git prefix for the repo and I haven't noticed any problem with that.

mtszkw commented 3 years ago

I could see one error in KF app saying that git repo secret was not found

davidspek commented 3 years ago

That probably blocked the deployment. I need to run an errand at the moment, but I can look at what is causing this afterwards.

mtszkw commented 3 years ago

Thanks a lot, I'll be waiting. It exactly said: secret "git-repo-secret" not found. I can see that git-repo-secret is indeed being used in secret.yaml and configmap-patch.yaml but I honestly I don't know how this should be used properly.

davidspek commented 3 years ago

What you can do is just manually apply the applications from the argocd-applications directory that you want to deploy. The Kubeflow.yaml is mainly meant as a convenient way to deploy everything at once. This way you don't have to wait for me to continue with your work.

mtszkw commented 3 years ago

Redeployed and I still can see ComparisonError: secret "git-repo-secret" not found. How (and why) should I set up this secret correctly for a public git repo? More than that, I am unable to sync or edit my app configuration in Argo UI, because:

Unable to load data: Request has been terminated Possible causes: the network is offline, Origin is not allowed by Access-Control-Allow-Origin, the page is being unloaded, etc.

Which might make sense as I run this deployment only on my localhost, so I guess this can all be offline.

davidspek commented 3 years ago

I've seen that Argo UI error come up a few times, and refreshing usually solves it. Are you using the port-forward method from the Argo CD documentation to access the UI? Were you able to deploy the individual application specs?

mtszkw commented 3 years ago

I am using port-forward to access UI, I managed to deploy individual apps manually, they show up in applications tab but they have the same problem as kubeflow: Status Healthy, Sync Unknown, git-repo-secret Error

CodeBooster97 commented 3 years ago

Hello @mtszkw

The git-repo-secret is the username/password that argoCD needs to connect to github for accessing the Manifests. You should have one Pod running external-secret in kube-system namespaces. This one pulls the secret from Secret Manager in AWS and put it inside the cluster. Try to check the logs from this pod. If the Secrets are not there argoCD is unable to fetch the manifests

mtszkw commented 3 years ago

Hi @GetOn4. This pod is up and running, no suspiciosu events, although one thing I noticed in pod logs is:

 Environment:
      AWS_ROLE_ARN:                  <<__role_arn.external_secrets__>>

I think this one was missing in config and could not be replaced properly (this PR seems to fix it).

CodeBooster97 commented 3 years ago

If the pod is up and running without errors it should get the secrets from AWS Secret Manager and create the secrets in the cluster. I've got some permission errors by setting this up.

You can check it: kubectl get secret git-repo-secret -n argocd

mtszkw commented 3 years ago

kubectl get secret git-repo-secret -n argocd

Error from server (NotFound): secrets "git-repo-secret" not found

Which is exactly the error I see in Argo UI for each application that is running. I see other secrets though:

argocd            argocd-application-controller-token-vr577        kubernetes.io/service-account-token   3      63m
argocd            argocd-dex-server-token-swf8k                    kubernetes.io/service-account-token   3      63m
argocd            argocd-initial-admin-secret                      Opaque                                1      62m
argocd            argocd-redis-token-nz2gj                         kubernetes.io/service-account-token   3      63m
argocd            argocd-secret                                    Opaque                                5      63m
argocd            argocd-server-token-rh5nf                        kubernetes.io/service-account-token   3      63m
argocd            default-token-kwhgr                              kubernetes.io/service-account-token   3      63m
...
CodeBooster97 commented 3 years ago

You won't see an error since the deployment is fine. You should see some erros in the exteernal-secrets pod.

The file in argoflow-aws/distribution/argocd/secret.yaml describes the secrets. Only if argoCD gets this secrets further deployments are working

mtszkw commented 3 years ago

The file in argoflow-aws/distribution/argocd/secret.yaml describes the secrets.

So in this file again, roleArn was not replaced successfully. Is that the cause of problem I am having?

roleArn: <<__role_arn.external_secrets.argocd__>>

CodeBooster97 commented 3 years ago

It causing problems. You should set it. Are you sure the deployment of argoCD is fine? Can you access the dashboard? Further do you have the external-secret pod running in kube-system? kubectl get pods -n kube-system

mtszkw commented 3 years ago

Yes, external-secret pod is running in kube-system, has no error events (https://github.com/argoflow/argoflow-aws/issues/84#issuecomment-850291729) ArgoCD seems fine, I can access the dashboard, see applications running, but: https://github.com/argoflow/argoflow-aws/issues/84#issuecomment-850281558

I will now set role_arn.external_secrets.argocd, update the environment and see what happens.

CodeBooster97 commented 3 years ago

That's weird. You should have the secrets if there are no errors shown

mtszkw commented 3 years ago

Ok @GetOn4, after re-running everything:

but:

Application conditions ComparisonError secret "git-repo-secret" not found

CodeBooster97 commented 3 years ago

Could you please show me the logs from the external-secret pod?

mtszkw commented 3 years ago

Oh, I was only looking at the events history, forgot about logs, yeah. That makes more sense now.

{"level":50,"message_time":"2021-05-28T10:51:50.114Z","pid":18,"hostname":"external-secrets-6cbb666466-zvkxs","payload":{"message":"Missing credentials in config, if using AWS_CONFIG_FILE, set AWS_SDK_LOAD_CONFIG=1","code":"CredentialsError","time":"2021-05-28T10:51:50.113Z","requestId":"fb050f26-d742-4b55-b765-d8bc65adefd6","statusCode":403,"retryable":false,"retryDelay":0.2847250004510915,"originalError":{"message":"Could not load credentials from ChainableTemporaryCredentials","code":"CredentialsError","time":"2021-05-28T10:51:50.113Z","requestId":"fb050f26-d742-4b55-b765-d8bc65adefd6","statusCode":403,"retryable":false,"retryDelay":0.2847250004510915,"originalError":{"message":"User: arn:aws:sts::XXXX:assumed-role/ecas_argoflow_test2021052810242172600000000c/i-046453dc839a8d422 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::XXXX:role/ecas_argoflow_test-admin","code":"AccessDenied","time":"2021-05-28T10:51:50.113Z","requestId":"fb050f26-d742-4b55-b765-d8bc65adefd6","statusCode":403,"retryable":false,"retryDelay":0.2847250004510915}}},"msg":"failure while polling the secret argocd/git-repo-secret"}

karlschriek commented 3 years ago

@mtszkw if you are using "Option 2" as described in the README, please see the updated instructions here: https://github.com/argoflow/argoflow-aws/pull/87

The policy to allow the external-secret IRSA role to assume the roles for each specific secret was missing.

davidspek commented 3 years ago

@mtszkw Not sure if you are still having this problem, but I believe removing this section will fix not needing a secret when using a public repository.

karlschriek commented 3 years ago

I guess we should probably make the base ArgoCD spec for public repos and then make a overlay that requires credentials for a private one

mtszkw commented 3 years ago

@DavidSpek @karlschriek I paused this project for a moment, will probably come back to this after weekend