RasaHQ / rasa-x-helm

Rasa Enterprise Helm chart for deploying on Kubernetes (K8s) and OpenShift.
Apache License 2.0
76 stars 104 forks source link

rasa-x and event-service livenessProbe and readinessProbe failed #211

Closed pierluigilenoci closed 3 years ago

pierluigilenoci commented 3 years ago

Hello, we installed RasaX with this chart and this values.yaml file:

app:
  existingUrl: https://[REDACTED]/webhook
  install: false
global:
  postgresql:
    existingSecret: [REDACTED]
    postgresqlUsername: [REDACTED]
images:
  pullPolicy: IfNotPresent
ingress:
  enabled: false
networkPolicy:
  enabled: true
nginx:
  enabled: false
postgresql:
  existingHost: [REDACTED]
  existingSecretKey: [REDACTED]
  install: false
rabbitmq:
  enabled: true
  existingHost: [REDACTED]
  existingPasswordSecretKey: [REDACTED]
  install: false
  rabbitmq:
    existingErlangSecret: [REDACTED]
    existingPasswordSecret: [REDACTED]
rasaSecret: [REDACTED]
rasax:
  persistence:
    existingClaim: [REDACTED]
redis:
  existingHost: [REDACTED]
  existingSecret: [REDACTED]
  install: false
securityContext:
  fsGroup: 1000
  runAsNonRoot: true
  runAsUser: 1000

We have a standalone RabbitMQ and Redis instance running in the same namespace.

All pods work fine except rasa-x and event-service in CrashLoopBack with these error messages:

Readiness probe failed: Get "http://10.4.6.139:5002/": dial tcp 10.4.6.139:5002: connect: connection refused
Liveness probe failed: Get "http://10.4.6.139:5002/": dial tcp 10.4.6.139:5002: connect: connection refused
Readiness probe failed: Get "http://10.4.6.92:5673/health": dial tcp 10.4.6.92:5673: connect: connection refused
Liveness probe failed: Get "http://10.4.6.92:5673/health": dial tcp 10.4.6.92:5673: connect: connection refused

Logs from rasa-x pod:

Starting Rasa X server... ๐Ÿš€
[2021-06-17 13:59:39 +0000] [9] [INFO] Goin' Fast @ http://0.0.0.0:5002

No logs from event-service.

Pods:

rasa-x-db-migration-service-0                               1/1     Running   1          3m24s
rasa-x-duckling-d6f66bcd7-s6fnv                             1/1     Running   0          3m24s
rasa-x-event-service-5b65488799-pdcmw                       0/1     Running   4          3m24s
rasa-x-rabbitmq-0                                           1/1     Running   0          23h
rasa-x-rasa-production-76fbfb856f-8tcqb                     1/1     Running   1          3m24s
rasa-x-rasa-worker-99c8f4bf8-j8ldf                          1/1     Running   1          3m24s
rasa-x-rasa-x-7c554994-dftpv                                0/1     Running   1          3m24s
rasa-x-redis-master-0                                       2/2     Running   0          23h

rasa-x events:

Events:
  Type     Reason                  Age                     From                        Message
  ----     ------                  ----                    ----                        -------
  Normal   Scheduled               4m34s                   default-scheduler           Successfully assigned [REDACTED]/rasa-x-rasa-x-7c554994-c5qq8 to [REDACTED]
  Normal   SuccessfulAttachVolume  4m18s                   attachdetach-controller     AttachVolume.Attach succeeded for volume "[REDACTED]"
  Normal   Created                 4m15s                   kubelet                     Created container rasa-x
  Normal   Started                 4m15s                   kubelet                     Started container rasa-x
  Warning  Unhealthy               2m34s (x10 over 4m4s)   kubelet                     Readiness probe failed: Get "http://10.4.6.140:5002/": dial tcp 10.4.6.140:5002: connect: connection refused
  Normal   Pulled                  2m28s (x2 over 4m15s)   kubelet                     Container image "rasa/rasa-x:0.40.1" already present on machine
  Warning  Unhealthy               2m28s (x10 over 3m58s)  kubelet                     Liveness probe failed: Get "http://10.4.6.140:5002/": dial tcp 10.4.6.140:5002: connect: connection refused
  Normal   Killing                 2m28s                   kubelet                     Container rasa-x failed liveness probe, will be restarted
  Normal   SecretRotationComplete  84s (x2 over 3m28s)     csi-secrets-store-rotation  successfully rotated K8s secret [REDACTED]

event-service events:

Events:
  Type     Reason                  Age               From                        Message
  ----     ------                  ----              ----                        -------
  Normal   Scheduled               76s               default-scheduler           Successfully assigned [REDACTED]/rasa-x-event-service-5b65488799-g8zjf to [REDACTED]
  Normal   Pulled                  75s               kubelet                     Container image "alpine:3.12.3" already present on machine
  Normal   Created                 75s               kubelet                     Created container init-db
  Normal   Started                 75s               kubelet                     Started container init-db
  Normal   SecretRotationComplete  9s                csi-secrets-store-rotation  successfully rotated K8s secret [REDACTED]
  Warning  Unhealthy               9s (x5 over 59s)  kubelet                     Readiness probe failed: Get "http://10.4.6.151:5673/health": dial tcp 10.4.6.151:5673: connect: connection refused
  Normal   Created                 8s (x3 over 74s)  kubelet                     Created container rasa-x
  Normal   Started                 8s (x3 over 74s)  kubelet                     Started container rasa-x
  Warning  Unhealthy               8s (x6 over 58s)  kubelet                     Liveness probe failed: Get "http://10.4.6.151:5673/health": dial tcp 10.4.6.151:5673: connect: connection refused
  Normal   Killing                 8s (x2 over 38s)  kubelet                     Container rasa-x failed liveness probe, will be restarted

How to solve it?

pierluigilenoci commented 3 years ago

Note: If I disable the network policy for the chart rasa-x is able to run.

sara-tagger commented 3 years ago

Thanks for the issue, @JustinaPetr will get back to you about it soon!

You may find help in the docs and the forum, too ๐Ÿค—
pierluigilenoci commented 3 years ago

@justinaPetr are there any news?

pierluigilenoci commented 3 years ago

@tmbo could you please take a look?

tczekajlo commented 3 years ago

It looks like kube-probe doesn't have access to the pods because of the network policy, try to add CIDR/IP address that kube-probe uses to communicate with the pods network.

See https://github.com/RasaHQ/rasa-x-helm/blob/main/charts/rasa-x/values.yaml#L754

pierluigilenoci commented 3 years ago

@tczekajlo I tried, still not working.

Values:

networkPolicy:
  enabled: true
  nodeCIDR:
    - ipBlock:
        cidr: 10.4.0.0/16

NP:

---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
annotations:
  meta.helm.sh/release-name: rasa-x
  meta.helm.sh/release-namespace: [REDACTED]
labels:
  app.kubernetes.io/managed-by: Helm
name: ingress-egress-from-kubelet-to-event-service
namespace: [REDACTED]
spec:
egress:
- to:
  - ipBlock:
      cidr: 10.4.0.0/16
ingress:
- from:
  - ipBlock:
      cidr: 10.4.0.0/16
  ports:
  - port: 5673
    protocol: TCP
podSelector:
  matchLabels:
    app.kubernetes.io/component: event-service
policyTypes:
- Ingress
- Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
annotations:
  meta.helm.sh/release-name: rasa-x
  meta.helm.sh/release-namespace: [REDACTED]
labels:
  app.kubernetes.io/managed-by: Helm
name: ingress-egress-from-kubelet-to-rasa-production
namespace: [REDACTED]
spec:
egress:
- to:
  - ipBlock:
      cidr: 10.4.0.0/16
ingress:
- from:
  - ipBlock:
      cidr: 10.4.0.0/16
  ports:
  - port: 5005
    protocol: TCP
podSelector:
  matchLabels:
    app.kubernetes.io/component: rasa-production
policyTypes:
- Ingress
- Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
annotations:
  meta.helm.sh/release-name: rasa-x
  meta.helm.sh/release-namespace: [REDACTED]
labels:
  app.kubernetes.io/managed-by: Helm
name: ingress-egress-from-kubelet-to-rasa-worker
namespace: [REDACTED]
spec:
egress:
- to:
  - ipBlock:
      cidr: 10.4.0.0/16
ingress:
- from:
  - ipBlock:
      cidr: 10.4.0.0/16
  ports:
  - port: 5005
    protocol: TCP
podSelector:
  matchLabels:
    app.kubernetes.io/component: rasa-worker
policyTypes:
- Ingress
- Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
annotations:
  meta.helm.sh/release-name: rasa-x
  meta.helm.sh/release-namespace: [REDACTED]
labels:
  app.kubernetes.io/managed-by: Helm
name: ingress-egress-from-kubelet-to-rasa-x
namespace: [REDACTED]
spec:
egress:
- to:
  - ipBlock:
      cidr: 10.4.0.0/16
ingress:
- from:
  - ipBlock:
      cidr: 10.4.0.0/16
  ports:
  - port: 5002
    protocol: TCP
podSelector:
  matchLabels:
    app.kubernetes.io/component: rasa-x
policyTypes:
- Ingress
- Egress

K8s events:

[REDACTED]       3m50s       Warning   Unhealthy                pod/rasa-x-duckling-d6f66bcd7-p6dl8                                   Readiness probe failed: Get "http://10.4.0.125:8000/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED]       4m8s        Warning   Unhealthy                pod/rasa-x-rasa-production-76fbfb856f-sw4xk                           Liveness probe failed: Get "http://10.4.1.12:5005/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED]       3m44s       Warning   Unhealthy                pod/rasa-x-duckling-d6f66bcd7-p6dl8                                   Liveness probe failed: Get "http://10.4.0.125:8000/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED]       3m57s       Warning   Unhealthy                pod/rasa-x-rasa-worker-99c8f4bf8-gq6kc                                Liveness probe failed: Get "http://10.4.1.16:5005/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED]       38s         Warning   Unhealthy                pod/rasa-x-rasa-production-76fbfb856f-zkhql                           Liveness probe failed: Get "http://10.4.0.146:5005/": dial tcp 10.4.0.146:5005: connect: connection refused
[REDACTED]       37s         Warning   Unhealthy                pod/rasa-x-rasa-worker-99c8f4bf8-znpfq                                Liveness probe failed: Get "http://10.4.0.125:5005/": dial tcp 10.4.0.125:5005: connect: connection refused
[REDACTED]       76s         Warning   Unhealthy                pod/rasa-x-event-service-5b65488799-hj8cn                             Liveness probe failed: Get "http://10.4.0.155:5673/health": dial tcp 10.4.0.155:5673: connect: connection refused
[REDACTED]       76s         Warning   Unhealthy                pod/rasa-x-event-service-5b65488799-hj8cn                             Readiness probe failed: Get "http://10.4.0.155:5673/health": dial tcp 10.4.0.155:5673: connect: connection refused
[REDACTED]       14s         Warning   Unhealthy                pod/rasa-x-rasa-x-7c554994-mtgcn                                      Readiness probe failed: Get "http://10.4.6.155:5002/": dial tcp 10.4.6.155:5002: connect: connection refused
[REDACTED]       13s         Warning   Unhealthy                pod/rasa-x-rasa-x-7c554994-mtgcn                                      Liveness probe failed: Get "http://10.4.6.155:5002/": dial tcp 10.4.6.155:5002: connect: connection refused
[REDACTED]       7s          Warning   Unhealthy                pod/rasa-x-db-migration-service-0                                     Readiness probe failed: HTTP probe failed with statuscode: 500
[REDACTED]       5s          Warning   Unhealthy                pod/rasa-x-db-migration-service-0                                     Liveness probe failed: HTTP probe failed with statuscode: 500

Any other suggestions?

pierluigilenoci commented 3 years ago

@tmbo I really need help figuring out how to proceed, could you please take an look?

pierluigilenoci commented 3 years ago

@virtualroot @tczekajlo @melindaloubser1 @HotThoughts @mvielkind could someone give us some hints to how to solve this?

tczekajlo commented 3 years ago

@pierluigilenoci What CNI do you use in a cluster where you an issue? Additionally, are you sure that an src address for requests that come from kube-probe is within the 10.4.0.0/16 CIDR?

pierluigilenoci commented 3 years ago

@tczekajlo the cluster is a hosted AKS instance (so Azure CNI with Calico for network policies, fully managed by Microsoft).

I am sure because it is the CIDR of the VNET of the cluster and we have some network policies (for example for kube-dns) that work perfectly.

---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: kube-system.allow-kube-dns-from-vnet
  namespace: kube-system
spec:
  podSelector:
    matchLabels:
      k8s-app: kube-dns
  ingress:
    - ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
      from:
        - ipBlock:
            # CIDR of the cluster VNET
            cidr: 10.4.0.0/16
  policyTypes:
    - Ingress
tczekajlo commented 3 years ago

@pierluigilenoci The last version of the helm chart (2.0.1) includes several missing network policies, please check it out if the latest version will solve your issues.

nyejon commented 3 years ago

@tczekajlo Could this have something to do with that we are using an external redis instance and in the various deployment.yaml files there is:

https://github.com/RasaHQ/rasa-x-helm/blob/a1c529cf503468611a758c20b6fd92d4e3ce40e7/charts/rasa-x/templates/event-service-deployment.yaml#L88

With "if redis.install"

So if redis is not installed, the redis password is not provided to the various deployments?

For RabbitMQ there is the "enabled" flag which is used instead.

pierluigilenoci commented 3 years ago

@pierluigilenoci The last version of the helm chart (2.0.1) includes several missing network policies, please check it out if the latest version will solve your issues.

I tried version 2.0.1 and this is the result:

[REDACTED]       3m1s        Warning   Unhealthy                pod/rasa-x-rasa-worker-8cf4cb55f-vpk2g                                Liveness probe failed: Get "http://10.4.6.164:5005/": dial tcp 10.4.6.164:5005: connect: connection refused
[REDACTED]       4m46s       Warning   Unhealthy                pod/rasa-x-event-service-5b57fc875d-tm4tp                             Liveness probe failed: Get "http://10.4.5.121:5673/health": dial tcp 10.4.5.121:5673: connect: connection refused
[REDACTED]       2m53s       Warning   Unhealthy                pod/rasa-x-rasa-production-89d6b6b7d-bnzbt                            Liveness probe failed: Get "http://10.4.6.153:5005/": dial tcp 10.4.6.153:5005: connect: connection refused
[REDACTED]       5m1s        Warning   Unhealthy                pod/rasa-x-event-service-5b57fc875d-tm4tp                             Readiness probe failed: Get "http://10.4.5.121:5673/health": dial tcp 10.4.5.121:5673: connect: connection refused
[REDACTED]       3m48s       Warning   Unhealthy                pod/rasa-x-rasa-x-84f75644c4-scv4r                                    Liveness probe failed: Get "http://10.4.6.171:5002/": dial tcp 10.4.6.171:5002: connect: connection refused
[REDACTED]       3m51s       Warning   Unhealthy                pod/rasa-x-rasa-x-84f75644c4-scv4r                                    Readiness probe failed: Get "http://10.4.6.171:5002/": dial tcp 10.4.6.171:5002: connect: connection refused
[REDACTED]       3m21s       Warning   Unhealthy                pod/rasa-x-db-migration-service-0                                     Liveness probe failed: HTTP probe failed with statuscode: 500
[REDACTED]       67s         Warning   Unhealthy                pod/rasa-x-db-migration-service-0                                     Readiness probe failed: HTTP probe failed with statuscode: 500
[REDACTED]       60s         Warning   Unhealthy                pod/rasa-x-rasa-worker-8cf4cb55f-vpk2g                                Liveness probe failed: Get "http://10.4.6.164:5005/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED]       39s         Warning   Unhealthy                pod/rasa-x-duckling-d6f66bcd7-2d4cr                                   Liveness probe failed: Get "http://10.4.6.121:8000/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED]       36s         Warning   Unhealthy                pod/rasa-x-duckling-d6f66bcd7-2d4cr                                   Readiness probe failed: Get "http://10.4.6.121:8000/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED]       52s         Warning   Unhealthy                pod/rasa-x-rasa-production-89d6b6b7d-bnzbt                            Liveness probe failed: Get "http://10.4.6.153:5005/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

In my first comment, you can see the configuration we are using to deploy Rasa-X

pierluigilenoci commented 3 years ago

@tczekajlo to give you a full picture these are the logs of pods:

Pod: rasa-x

Starting Rasa X server... ๐Ÿš€
[2021-07-13 12:53:55 +0000] [9] [INFO] Goin' Fast @ http://0.0.0.0:5002

Pod: db-migration

Starting the database migration service (http)... ๐Ÿš€
[2021-07-13 12:56:59 +0000] [6] [INFO] Goin' Fast @ http://0.0.0.0:8000
INFO:__main__:Starting the database migration service
[2021-07-13 12:56:59 +0000] [6] [INFO] Starting worker [6]
[2021-07-13 12:57:09 +0000] - (sanic.access)[INFO][10.4.6.120:44408]: GET http://10.4.6.165:8000/health  200 56
[2021-07-13 12:57:11 +0000] - (sanic.access)[INFO][10.4.6.120:44456]: GET http://10.4.6.165:8000/health  200 56

Pod: duckling

Listening on http://0.0.0.0:8000

Pod: event-service, rasa-worker and rasa-production

No log at all!
pierluigilenoci commented 3 years ago

I tried to do another experiment. I have removed the livenessProbe and the readinessProbe from the pod deployments. Pods are now not killed by Kubernetes but they crash on their own.

NAME                                                        READY   STATUS    RESTARTS   AGE
rasa-x-db-migration-service-0                               1/1     Running   4          11m
rasa-x-duckling-d6f66bcd7-xtv5b                             1/1     Running   0          11m
rasa-x-event-service-5979684db-v9xrl                        1/1     Running   0          9m21s
rasa-x-rabbitmq-0                                           1/1     Running   0          26d
rasa-x-rasa-production-89d6b6b7d-s8tg7                      1/1     Running   4          11m
rasa-x-rasa-worker-8cf4cb55f-d52tw                          1/1     Running   5          11m
rasa-x-rasa-x-579fb6dcfb-t8qsp                              1/1     Running   0          9m51s
rasa-x-redis-master-0                                       2/2     Running   0          6h19m

Note: Redis and RabbitMQ are installed separately. At the bottom the logs of the pods that are restarted.

Pod: db-migration-service

Starting the database migration service (http)... ๐Ÿš€
[2021-07-13 13:53:45 +0000] [6] [INFO] Goin' Fast @ http://0.0.0.0:8000
INFO:__main__:Starting the database migration service
[2021-07-13 13:53:45 +0000] [6] [INFO] Starting worker [6]
[2021-07-13 13:53:56 +0000] - (sanic.access)[INFO][10.4.6.120:49072]: GET http://10.4.6.141:8000/health  200 56
[2021-07-13 13:54:02 +0000] - (sanic.access)[INFO][10.4.6.120:49218]: GET http://10.4.6.141:8000/health  200 56
[2021-07-13 13:54:06 +0000] - (sanic.access)[INFO][10.4.6.120:49296]: GET http://10.4.6.141:8000/health  200 56
[2021-07-13 13:54:12 +0000] - (sanic.access)[INFO][10.4.6.120:49462]: GET http://10.4.6.141:8000/health  200 56
[2021-07-13 13:54:16 +0000] - (sanic.access)[INFO][10.4.6.120:49532]: GET http://10.4.6.141:8000/health  200 56
[2021-07-13 13:54:22 +0000] - (sanic.access)[INFO][10.4.6.120:49940]: GET http://10.4.6.141:8000/health  200 56
[2021-07-13 13:54:26 +0000] - (sanic.access)[INFO][10.4.6.120:50166]: GET http://10.4.6.141:8000/health  200 56
[2021-07-13 13:54:32 +0000] - (sanic.access)[INFO][10.4.6.120:50488]: GET http://10.4.6.141:8000/health  200 56
[2021-07-13 13:54:36 +0000] - (sanic.access)[INFO][10.4.6.120:50558]: GET http://10.4.6.141:8000/health  200 56

Pod: rasa-x

Starting Rasa X server... ๐Ÿš€
[2021-07-13 13:53:06 +0000] [9] [INFO] Goin' Fast @ http://0.0.0.0:5002
WARNING:rasax.community.database.utils:Unable to get database revision heads.
WARNING:rasax.community.database.utils:Unable to get database revision heads.
WARNING:rasax.community.database.utils:Unable to get database revision heads.
WARNING:rasax.community.database.utils:Unable to get database revision heads.
WARNING:rasax.community.database.utils:Unable to get database revision heads.
[2021-07-13 13:55:16 +0000] - (sanic.access)[INFO][10.4.1.189:34978]: GET http://rasa-x-rasa-x.[REDACTED].svc:5002/api/config?token=[REDACTED]  503 40
[2021-07-13 13:55:16 +0000] [21] [INFO] Starting worker [21]
[2021-07-13 13:55:16 +0000] [22] [INFO] Starting worker [22]
[2021-07-13 13:55:16 +0000] [24] [INFO] Starting worker [24]
[2021-07-13 13:55:16 +0000] [20] [INFO] Starting worker [20]

Pod: event-service, rasa-worker and rasa-production

No log at all!
tczekajlo commented 3 years ago

Could this have something to do with that we are using an external redis instance and in the various

Currently, the network policies in the rasa-x-helm chart don't support external services. If you use external services such as Redis, RabbitMQ, and so on and you want to use network policies, you have to create them and add them on your own.

Also, you can set the debugMode parameter to true, then you should see more information in the logs.

pierluigilenoci commented 3 years ago

@tczekajlo whereas the chart, in theory, supports the use of external Redis and RabbitMQ when this problem will be solved with the Network Policy?

It would be enough to allow to configure in some way this label: https://github.com/RasaHQ/rasa-x-helm/blob/1dd6ad168c8e7ce6dd7a3513fcaf939ff584ddac/charts/rasa-x/templates/network-policy.yaml#L284-L285

To get around the problem, I still added the labels to our Redis deployment to match the label of your policies.

So I tried the new version (2.0.2) of the chart and continues to give problems.

Pods:

rasa-x-db-migration-service-0                               1/1     Running            5          13m
rasa-x-duckling-d6f66bcd7-xx4dm                             1/1     Running            0          13m
rasa-x-event-service-5b57fc875d-qq7fl                       0/1     Running            5          3m41s
rasa-x-rabbitmq-0                                           1/1     Running            0          28d
rasa-x-rasa-production-788f44854f-fb24x                     0/1     CrashLoopBackOff   12         46m
rasa-x-rasa-worker-7787cf478b-dsxhg                         0/1     CrashLoopBackOff   12         46m
rasa-x-rasa-x-5c786bdb6d-4glgb                              0/1     CrashLoopBackOff   13         46m
rasa-x-redis-master-0                                       2/2     Running            0          27m

Events:

[REDACTED]       3m43s       Warning   Unhealthy                pod/rasa-x-duckling-d6f66bcd7-xtv5b                                   Liveness probe failed: Get "http://10.4.4.111:8000/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED]       3m40s       Warning   Unhealthy                pod/rasa-x-duckling-d6f66bcd7-xtv5b                                   Readiness probe failed: Get "http://10.4.4.111:8000/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED]       3m53s       Warning   Unhealthy                pod/rasa-x-duckling-d6f66bcd7-xtv5b                                   Liveness probe failed: Get "http://10.4.4.111:8000/": dial tcp 10.4.4.111:8000: i/o timeout (Client.Timeout exceeded while awaiting headers)
[REDACTED]       33s         Warning   Unhealthy                pod/rasa-x-rasa-worker-7787cf478b-dsxhg                               Liveness probe failed: Get "http://10.4.5.173:5005/": dial tcp 10.4.5.173:5005: connect: connection refused
[REDACTED]       72s         Warning   Unhealthy                pod/rasa-x-event-service-5b57fc875d-txh59                             Liveness probe failed: Get "http://10.4.1.19:5673/health": dial tcp 10.4.1.19:5673: connect: connection refused
[REDACTED]       81s         Warning   Unhealthy                pod/rasa-x-event-service-5b57fc875d-txh59                             Readiness probe failed: Get "http://10.4.1.19:5673/health": dial tcp 10.4.1.19:5673: connect: connection refused
[REDACTED]       28s         Warning   Unhealthy                pod/rasa-x-rasa-production-788f44854f-fb24x                           Liveness probe failed: Get "http://10.4.1.16:5005/": dial tcp 10.4.1.16:5005: connect: connection refused
[REDACTED]       23s         Warning   Unhealthy                pod/rasa-x-rasa-x-5c786bdb6d-4glgb                                    Liveness probe failed: Get "http://10.4.6.152:5002/": dial tcp 10.4.6.152:5002: connect: connection refused
[REDACTED]       32s         Warning   Unhealthy                pod/rasa-x-rasa-x-5c786bdb6d-4glgb                                    Readiness probe failed: Get "http://10.4.6.152:5002/": dial tcp 10.4.6.152:5002: connect: connection refused
[REDACTED]       2s          Warning   Unhealthy                pod/rasa-x-db-migration-service-0                                     Readiness probe failed: HTTP probe failed with statuscode: 500
[REDACTED]       0s          Warning   Unhealthy                pod/rasa-x-db-migration-service-0                                     Liveness probe failed: HTTP probe failed with statuscode: 500

A problem that I see evident is the fact that if there is an external installation of Redis the password is not passed due to these conditions:

https://github.com/RasaHQ/rasa-x-helm/blob/1dd6ad168c8e7ce6dd7a3513fcaf939ff584ddac/charts/rasa-x/templates/rasa-deployments.yaml#L123-L129

https://github.com/RasaHQ/rasa-x-helm/blob/1dd6ad168c8e7ce6dd7a3513fcaf939ff584ddac/charts/rasa-x/templates/event-service-deployment.yaml#L88-L94

https://github.com/RasaHQ/rasa-x-helm/blob/1dd6ad168c8e7ce6dd7a3513fcaf939ff584ddac/charts/rasa-x/templates/rasa-x-deployment.yaml#L102-L108

I manually added the environment variable to the deployment.

- name: REDIS_PASSWORD
  valueFrom:
    secretKeyRef:
      key: REDIS_PASSWORD
      name: [REDACTED]

And I removed Readiness and Liveness to get the pods running.

With these changes there has been a positive evolution as it still does not work 100%.

At this point db-migration-service started complaining about not being able to connect to postgres.

Process ForkProcess-1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect
    return fn()
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 304, in unique_connection
    return _ConnectionFairy._checkout(self)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 778, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 495, in checkout
    rec = pool._do_get()
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/impl.py", line 140, in _do_get
    self._dec_overflow()
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/impl.py", line 137, in _do_get
    return self._create_connection()
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 309, in _create_connection
    return _ConnectionRecord(self)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 440, in __init__
    self.__connect(first_connect_check=True)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 661, in __connect
    pool.logger.debug("Error on connect(): %s", e)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 656, in __connect
    connection = pool._invoke_creator(self)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/strategies.py", line 114, in connect
    return dialect.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 508, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.8/dist-packages/psycopg2/__init__.py", line 127, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: could not connect to server: Connection timed out
    Is the server running on host "[REDACTED].postgres.database.azure.com" (104.40.169.187) and accepting
    TCP/IP connections on port 5432?

So I had to manually create an egress policy for it to work. This is because your allow-dns-access policy blocks all connections from all pods in the namespace except for port 53 and specific egress policies.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-postgres-access
  namespace: [REDACTED]
spec:
  egress:
    - ports:
      - port: 5432
        protocol: UDP
      - port: 5432
        protocol: TCP
  podSelector: {}
  policyTypes:
    - Egress

After that db-migration-service completed the process.

Pod: db-migration-service

INFO:__main__:The database migration has finished. DB revision: ['652500998f3e']
[2021-07-15 11:19:18 +0000] - (sanic.access)[INFO][10.4.0.122:39392]: GET http://10.4.1.16:8000/health  200 56
[IDENTICAL LINES CUT]
[2021-07-15 11:56:10 +0000] - (sanic.access)[INFO][10.4.0.122:43312]: GET http://10.4.1.16:8000/health  200 56

The event-service pod is still in `CrashLoopBackOff.

Pod: event-service

Check for database migrations completed.
INFO:__main__:Starting event service (standalone: True).
INFO:rasax.community.services.event_consumers.event_consumer:Started Sanic liveness endpoint at port '5673'.
[2021-07-15 14:05:04 +0000] [19] [INFO] Goin' Fast @ http://0.0.0.0:5673
[2021-07-15 14:05:04 +0000] [19] [INFO] Starting worker [19]

Pod: duckling

Listening on http://0.0.0.0:8000

Then RabbitMQ (custom installation) started to complaining because was not able to connect to K8S API so I had to create another NP:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-api-access
  namespace: [REDACTED]
spec:
  egress:
    - ports:
      - port: 443
        protocol: UDP
      - port: 443
        protocol: TCP
  podSelector: {}
  policyTypes:
    - Egress

After that RabbitMQ started to work properly. So also rasa-production, rasa-x and rasa-worker started to work properly.

rasa-x-db-migration-service-0                               1/1     Running   0          173m
rasa-x-duckling-d6f66bcd7-xx4dm                             1/1     Running   0          4h10m
rasa-x-event-service-94d585499-4cl66                        1/1     Running   0          5m41s
rasa-x-rabbitmq-0                                           1/1     Running   0          10m
rasa-x-rasa-production-64d8589788-p4kth                     1/1     Running   0          5m55s
rasa-x-rasa-worker-5867678fd9-f6c22                         1/1     Running   0          5m21s
rasa-x-rasa-x-758d9f5f58-h9vnk                              1/1     Running   0          3h52m
rasa-x-redis-master-0                                       2/2     Running   0          4h24m

Now everything is working correctly (with my workaround), at least so it seems from the logs. We will then try to test the application and leave room for further reports.

To conclude, therefore, from our point of view, two network policies are missing, the rasa, event-service and rasa-x templates must be corrected for that wrong condition and the correct label is missing to open traffic to Redis.

pierluigilenoci commented 3 years ago

@tczekajlo a small update related to the tests performed. We found that an additional network policy is missing because pods can't make SSH calls to GitHub to download repositories.

---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-ssh-access
  namespace: [REDACTED]
spec:
  egress:
    - ports:
        - port: 22
          protocol: UDP
        - port: 22
          protocol: TCP
  podSelector: {}
  policyTypes:
    - Egress
pierluigilenoci commented 3 years ago

@tmbo @tczekajlo @JustinaPetr I would love to have feedback. โค๏ธ

tczekajlo commented 3 years ago

@pierluigilenoci I'll be able to take a look at it probably next week

pierluigilenoci commented 3 years ago

@tczekajlo did you find the time to take a look at it?

pierluigilenoci commented 3 years ago

@tczekajlo any update on this?

pierluigilenoci commented 3 years ago

@tczekajlo it's been 76 days since I opened the issue and 42 days since you said you'd take a look at it. Is there any news?

pierluigilenoci commented 3 years ago

Maybe @RASADSA or @rgstephens can help out?

RASADSA commented 3 years ago

@pierluigilenoci if i understand all correct you run all under AKS with calico CNI. The provided Network Policies inside the rasa x helmchart are meant as a blueprint to map out your own network policies. On certain setups they will work out of the box - not on all. Since there are a lot of CNI's out there in the kubernetes field and a lot of have different usage of the network policies its impossible for us to determine every case and debug it remotely. Especially Full Managed K8S Cloud provideres often are a blackbox on the CNI visibility point. I recommend to disable the rasa X network policies and walk through the official Azure AKS Network policy documentation https://docs.microsoft.com/en-us/azure/aks/use-network-policies to create your own attuned network policies.

pierluigilenoci commented 3 years ago

@RASADSA if you read all my comments on the issue I have explicitly documented all the problems of the current chart. Making a PR to correct the problem should be simple and above all "generalizable". There are no fixes that are specific to the cloud implementation.

RASADSA commented 3 years ago

@pierluigilenoci i recommend to wait for @tczekajlo return - since im not into that topic and i dont want to dismiss your effort.

per general every Kubernetes Administrator is accountable for their network policies - not the helm chart creator.

https://kubernetes.io/docs/concepts/services-networking/network-policies/

Network policies are implemented by the network plugin. To use network policies, you must be using a networking solution which supports NetworkPolicy. Creating a NetworkPolicy resource without a controller that implements it will have no effect.

https://rancher.com/blog/2019/2019-03-21-comparing-kubernetes-cni-providers-flannel-calico-canal-and-weave/

Kubernetesโ€™ adoption of the CNI standard allows for many different network solutions to exist within the same ecosystem. The diversity of options available means that most users will be able to find a CNI plugin that suits their current needs and deployment environment, while also providing solutions when their circumstances change. Operating requirements vary immensely between organisations, so having a number of mature solutions with different levels of complexity and feature richness helps Kubernetes satisfy unique requirements while still offering a fairly consistent user experience.

in my experience it makes a huge difference on network policies if you run them inside AWS / AKS / GCE or baremetal. Depending on what kind security concept you follow its hard to generalize network policies for all the different variations of CNI implementations.

Calico CNI as example: https://docs.projectcalico.org/getting-started/kubernetes/

Cloud Managed

self managed

etc. on top of it VXLAN, BGP, nodeport traffic, clusterip, Loadblancer / External Services.

a lot of people already using this helmchart and we should be very careful what we change on network policy level which could be just rolled out on the next helm chart upgrade run.

pierluigilenoci commented 3 years ago

@RASADSA I understand the concern in introducing changes to NPs but, as is customary, just release the chart with an appropriate version change, a note for breaking changes, and if you really want to be conservative a migration guide.

As I wrote my additions to NPs are safe and unrelated to implementation. They only open more ports starting from a condition of "all traffic blocked to and from the namespace".

My idea is that a helm chart, apart from the configurations/customizations of the values, should work without external intervention.

And anyway, I can also understand that you don't care about solving the problem because it doesn't affect you directly.

pierluigilenoci commented 3 years ago

@RASADSA @tczekajlo any update?

RASADSA commented 3 years ago

@pierluigilenoci after a longer discussion internally we will not extend the network policies and close the issue. We dont have the capacity to support multiple CNI's Network Policies and make sure that they always work. On the Topic on your own networkpolicies out of the box:

pierluigilenoci commented 3 years ago

@RASADSA I accept your choice but do not agree with it.

I am quite disappointed but certainly not surprised. For me, an open-source project should be managed differently. But they are obviously opinions and everyone has their own.

nyejon commented 3 years ago

@RASADSA should I open a separate issue for the Redis install issue then?

That was also not addressed as pointed out by @pierluigilenoci in https://github.com/RasaHQ/rasa-x-helm/issues/211#issuecomment-880736816

pierluigilenoci commented 2 years ago

@RASADSA so... in the end, it looks like I was right. ๐Ÿ˜‰

Ref: https://github.com/RasaHQ/rasa-x-helm/pull/275 https://github.com/RasaHQ/rasa-x-helm/pull/282