Closed pierluigilenoci closed 3 years ago
Note: If I disable the network policy for the chart rasa-x is able to run.
@justinaPetr are there any news?
@tmbo could you please take a look?
It looks like kube-probe
doesn't have access to the pods because of the network policy, try to add CIDR/IP address that kube-probe
uses to communicate with the pods network.
See https://github.com/RasaHQ/rasa-x-helm/blob/main/charts/rasa-x/values.yaml#L754
@tczekajlo I tried, still not working.
Values:
networkPolicy:
enabled: true
nodeCIDR:
- ipBlock:
cidr: 10.4.0.0/16
NP:
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
annotations:
meta.helm.sh/release-name: rasa-x
meta.helm.sh/release-namespace: [REDACTED]
labels:
app.kubernetes.io/managed-by: Helm
name: ingress-egress-from-kubelet-to-event-service
namespace: [REDACTED]
spec:
egress:
- to:
- ipBlock:
cidr: 10.4.0.0/16
ingress:
- from:
- ipBlock:
cidr: 10.4.0.0/16
ports:
- port: 5673
protocol: TCP
podSelector:
matchLabels:
app.kubernetes.io/component: event-service
policyTypes:
- Ingress
- Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
annotations:
meta.helm.sh/release-name: rasa-x
meta.helm.sh/release-namespace: [REDACTED]
labels:
app.kubernetes.io/managed-by: Helm
name: ingress-egress-from-kubelet-to-rasa-production
namespace: [REDACTED]
spec:
egress:
- to:
- ipBlock:
cidr: 10.4.0.0/16
ingress:
- from:
- ipBlock:
cidr: 10.4.0.0/16
ports:
- port: 5005
protocol: TCP
podSelector:
matchLabels:
app.kubernetes.io/component: rasa-production
policyTypes:
- Ingress
- Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
annotations:
meta.helm.sh/release-name: rasa-x
meta.helm.sh/release-namespace: [REDACTED]
labels:
app.kubernetes.io/managed-by: Helm
name: ingress-egress-from-kubelet-to-rasa-worker
namespace: [REDACTED]
spec:
egress:
- to:
- ipBlock:
cidr: 10.4.0.0/16
ingress:
- from:
- ipBlock:
cidr: 10.4.0.0/16
ports:
- port: 5005
protocol: TCP
podSelector:
matchLabels:
app.kubernetes.io/component: rasa-worker
policyTypes:
- Ingress
- Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
annotations:
meta.helm.sh/release-name: rasa-x
meta.helm.sh/release-namespace: [REDACTED]
labels:
app.kubernetes.io/managed-by: Helm
name: ingress-egress-from-kubelet-to-rasa-x
namespace: [REDACTED]
spec:
egress:
- to:
- ipBlock:
cidr: 10.4.0.0/16
ingress:
- from:
- ipBlock:
cidr: 10.4.0.0/16
ports:
- port: 5002
protocol: TCP
podSelector:
matchLabels:
app.kubernetes.io/component: rasa-x
policyTypes:
- Ingress
- Egress
K8s events:
[REDACTED] 3m50s Warning Unhealthy pod/rasa-x-duckling-d6f66bcd7-p6dl8 Readiness probe failed: Get "http://10.4.0.125:8000/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED] 4m8s Warning Unhealthy pod/rasa-x-rasa-production-76fbfb856f-sw4xk Liveness probe failed: Get "http://10.4.1.12:5005/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED] 3m44s Warning Unhealthy pod/rasa-x-duckling-d6f66bcd7-p6dl8 Liveness probe failed: Get "http://10.4.0.125:8000/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED] 3m57s Warning Unhealthy pod/rasa-x-rasa-worker-99c8f4bf8-gq6kc Liveness probe failed: Get "http://10.4.1.16:5005/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED] 38s Warning Unhealthy pod/rasa-x-rasa-production-76fbfb856f-zkhql Liveness probe failed: Get "http://10.4.0.146:5005/": dial tcp 10.4.0.146:5005: connect: connection refused
[REDACTED] 37s Warning Unhealthy pod/rasa-x-rasa-worker-99c8f4bf8-znpfq Liveness probe failed: Get "http://10.4.0.125:5005/": dial tcp 10.4.0.125:5005: connect: connection refused
[REDACTED] 76s Warning Unhealthy pod/rasa-x-event-service-5b65488799-hj8cn Liveness probe failed: Get "http://10.4.0.155:5673/health": dial tcp 10.4.0.155:5673: connect: connection refused
[REDACTED] 76s Warning Unhealthy pod/rasa-x-event-service-5b65488799-hj8cn Readiness probe failed: Get "http://10.4.0.155:5673/health": dial tcp 10.4.0.155:5673: connect: connection refused
[REDACTED] 14s Warning Unhealthy pod/rasa-x-rasa-x-7c554994-mtgcn Readiness probe failed: Get "http://10.4.6.155:5002/": dial tcp 10.4.6.155:5002: connect: connection refused
[REDACTED] 13s Warning Unhealthy pod/rasa-x-rasa-x-7c554994-mtgcn Liveness probe failed: Get "http://10.4.6.155:5002/": dial tcp 10.4.6.155:5002: connect: connection refused
[REDACTED] 7s Warning Unhealthy pod/rasa-x-db-migration-service-0 Readiness probe failed: HTTP probe failed with statuscode: 500
[REDACTED] 5s Warning Unhealthy pod/rasa-x-db-migration-service-0 Liveness probe failed: HTTP probe failed with statuscode: 500
Any other suggestions?
@tmbo I really need help figuring out how to proceed, could you please take an look?
@virtualroot @tczekajlo @melindaloubser1 @HotThoughts @mvielkind could someone give us some hints to how to solve this?
@pierluigilenoci What CNI do you use in a cluster where you an issue? Additionally, are you sure that an src address for requests that come from kube-probe is within the 10.4.0.0/16 CIDR?
@tczekajlo the cluster is a hosted AKS instance (so Azure CNI with Calico for network policies, fully managed by Microsoft).
I am sure because it is the CIDR of the VNET of the cluster and we have some network policies (for example for kube-dns) that work perfectly.
---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: kube-system.allow-kube-dns-from-vnet
namespace: kube-system
spec:
podSelector:
matchLabels:
k8s-app: kube-dns
ingress:
- ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
from:
- ipBlock:
# CIDR of the cluster VNET
cidr: 10.4.0.0/16
policyTypes:
- Ingress
@pierluigilenoci The last version of the helm chart (2.0.1) includes several missing network policies, please check it out if the latest version will solve your issues.
@tczekajlo Could this have something to do with that we are using an external redis instance and in the various deployment.yaml files there is:
With "if redis.install"
So if redis is not installed, the redis password is not provided to the various deployments?
For RabbitMQ there is the "enabled" flag which is used instead.
@pierluigilenoci The last version of the helm chart (2.0.1) includes several missing network policies, please check it out if the latest version will solve your issues.
I tried version 2.0.1 and this is the result:
[REDACTED] 3m1s Warning Unhealthy pod/rasa-x-rasa-worker-8cf4cb55f-vpk2g Liveness probe failed: Get "http://10.4.6.164:5005/": dial tcp 10.4.6.164:5005: connect: connection refused
[REDACTED] 4m46s Warning Unhealthy pod/rasa-x-event-service-5b57fc875d-tm4tp Liveness probe failed: Get "http://10.4.5.121:5673/health": dial tcp 10.4.5.121:5673: connect: connection refused
[REDACTED] 2m53s Warning Unhealthy pod/rasa-x-rasa-production-89d6b6b7d-bnzbt Liveness probe failed: Get "http://10.4.6.153:5005/": dial tcp 10.4.6.153:5005: connect: connection refused
[REDACTED] 5m1s Warning Unhealthy pod/rasa-x-event-service-5b57fc875d-tm4tp Readiness probe failed: Get "http://10.4.5.121:5673/health": dial tcp 10.4.5.121:5673: connect: connection refused
[REDACTED] 3m48s Warning Unhealthy pod/rasa-x-rasa-x-84f75644c4-scv4r Liveness probe failed: Get "http://10.4.6.171:5002/": dial tcp 10.4.6.171:5002: connect: connection refused
[REDACTED] 3m51s Warning Unhealthy pod/rasa-x-rasa-x-84f75644c4-scv4r Readiness probe failed: Get "http://10.4.6.171:5002/": dial tcp 10.4.6.171:5002: connect: connection refused
[REDACTED] 3m21s Warning Unhealthy pod/rasa-x-db-migration-service-0 Liveness probe failed: HTTP probe failed with statuscode: 500
[REDACTED] 67s Warning Unhealthy pod/rasa-x-db-migration-service-0 Readiness probe failed: HTTP probe failed with statuscode: 500
[REDACTED] 60s Warning Unhealthy pod/rasa-x-rasa-worker-8cf4cb55f-vpk2g Liveness probe failed: Get "http://10.4.6.164:5005/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED] 39s Warning Unhealthy pod/rasa-x-duckling-d6f66bcd7-2d4cr Liveness probe failed: Get "http://10.4.6.121:8000/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED] 36s Warning Unhealthy pod/rasa-x-duckling-d6f66bcd7-2d4cr Readiness probe failed: Get "http://10.4.6.121:8000/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED] 52s Warning Unhealthy pod/rasa-x-rasa-production-89d6b6b7d-bnzbt Liveness probe failed: Get "http://10.4.6.153:5005/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
In my first comment, you can see the configuration we are using to deploy Rasa-X
@tczekajlo to give you a full picture these are the logs of pods:
Pod: rasa-x
Starting Rasa X server... ๐
[2021-07-13 12:53:55 +0000] [9] [INFO] Goin' Fast @ http://0.0.0.0:5002
Pod: db-migration
Starting the database migration service (http)... ๐
[2021-07-13 12:56:59 +0000] [6] [INFO] Goin' Fast @ http://0.0.0.0:8000
INFO:__main__:Starting the database migration service
[2021-07-13 12:56:59 +0000] [6] [INFO] Starting worker [6]
[2021-07-13 12:57:09 +0000] - (sanic.access)[INFO][10.4.6.120:44408]: GET http://10.4.6.165:8000/health 200 56
[2021-07-13 12:57:11 +0000] - (sanic.access)[INFO][10.4.6.120:44456]: GET http://10.4.6.165:8000/health 200 56
Pod: duckling
Listening on http://0.0.0.0:8000
Pod: event-service
, rasa-worker
and rasa-production
No log at all!
I tried to do another experiment. I have removed the livenessProbe and the readinessProbe from the pod deployments. Pods are now not killed by Kubernetes but they crash on their own.
NAME READY STATUS RESTARTS AGE
rasa-x-db-migration-service-0 1/1 Running 4 11m
rasa-x-duckling-d6f66bcd7-xtv5b 1/1 Running 0 11m
rasa-x-event-service-5979684db-v9xrl 1/1 Running 0 9m21s
rasa-x-rabbitmq-0 1/1 Running 0 26d
rasa-x-rasa-production-89d6b6b7d-s8tg7 1/1 Running 4 11m
rasa-x-rasa-worker-8cf4cb55f-d52tw 1/1 Running 5 11m
rasa-x-rasa-x-579fb6dcfb-t8qsp 1/1 Running 0 9m51s
rasa-x-redis-master-0 2/2 Running 0 6h19m
Note: Redis and RabbitMQ are installed separately. At the bottom the logs of the pods that are restarted.
Pod: db-migration-service
Starting the database migration service (http)... ๐
[2021-07-13 13:53:45 +0000] [6] [INFO] Goin' Fast @ http://0.0.0.0:8000
INFO:__main__:Starting the database migration service
[2021-07-13 13:53:45 +0000] [6] [INFO] Starting worker [6]
[2021-07-13 13:53:56 +0000] - (sanic.access)[INFO][10.4.6.120:49072]: GET http://10.4.6.141:8000/health 200 56
[2021-07-13 13:54:02 +0000] - (sanic.access)[INFO][10.4.6.120:49218]: GET http://10.4.6.141:8000/health 200 56
[2021-07-13 13:54:06 +0000] - (sanic.access)[INFO][10.4.6.120:49296]: GET http://10.4.6.141:8000/health 200 56
[2021-07-13 13:54:12 +0000] - (sanic.access)[INFO][10.4.6.120:49462]: GET http://10.4.6.141:8000/health 200 56
[2021-07-13 13:54:16 +0000] - (sanic.access)[INFO][10.4.6.120:49532]: GET http://10.4.6.141:8000/health 200 56
[2021-07-13 13:54:22 +0000] - (sanic.access)[INFO][10.4.6.120:49940]: GET http://10.4.6.141:8000/health 200 56
[2021-07-13 13:54:26 +0000] - (sanic.access)[INFO][10.4.6.120:50166]: GET http://10.4.6.141:8000/health 200 56
[2021-07-13 13:54:32 +0000] - (sanic.access)[INFO][10.4.6.120:50488]: GET http://10.4.6.141:8000/health 200 56
[2021-07-13 13:54:36 +0000] - (sanic.access)[INFO][10.4.6.120:50558]: GET http://10.4.6.141:8000/health 200 56
Pod: rasa-x
Starting Rasa X server... ๐
[2021-07-13 13:53:06 +0000] [9] [INFO] Goin' Fast @ http://0.0.0.0:5002
WARNING:rasax.community.database.utils:Unable to get database revision heads.
WARNING:rasax.community.database.utils:Unable to get database revision heads.
WARNING:rasax.community.database.utils:Unable to get database revision heads.
WARNING:rasax.community.database.utils:Unable to get database revision heads.
WARNING:rasax.community.database.utils:Unable to get database revision heads.
[2021-07-13 13:55:16 +0000] - (sanic.access)[INFO][10.4.1.189:34978]: GET http://rasa-x-rasa-x.[REDACTED].svc:5002/api/config?token=[REDACTED] 503 40
[2021-07-13 13:55:16 +0000] [21] [INFO] Starting worker [21]
[2021-07-13 13:55:16 +0000] [22] [INFO] Starting worker [22]
[2021-07-13 13:55:16 +0000] [24] [INFO] Starting worker [24]
[2021-07-13 13:55:16 +0000] [20] [INFO] Starting worker [20]
Pod: event-service
, rasa-worker
and rasa-production
No log at all!
Could this have something to do with that we are using an external redis instance and in the various
Currently, the network policies in the rasa-x-helm chart don't support external services. If you use external services such as Redis, RabbitMQ, and so on and you want to use network policies, you have to create them and add them on your own.
Also, you can set the debugMode
parameter to true
, then you should see more information in the logs.
@tczekajlo whereas the chart, in theory, supports the use of external Redis
and RabbitMQ
when this problem will be solved with the Network Policy?
It would be enough to allow to configure in some way this label: https://github.com/RasaHQ/rasa-x-helm/blob/1dd6ad168c8e7ce6dd7a3513fcaf939ff584ddac/charts/rasa-x/templates/network-policy.yaml#L284-L285
To get around the problem, I still added the labels to our Redis
deployment to match the label of your policies.
So I tried the new version (2.0.2) of the chart and continues to give problems.
Pods:
rasa-x-db-migration-service-0 1/1 Running 5 13m
rasa-x-duckling-d6f66bcd7-xx4dm 1/1 Running 0 13m
rasa-x-event-service-5b57fc875d-qq7fl 0/1 Running 5 3m41s
rasa-x-rabbitmq-0 1/1 Running 0 28d
rasa-x-rasa-production-788f44854f-fb24x 0/1 CrashLoopBackOff 12 46m
rasa-x-rasa-worker-7787cf478b-dsxhg 0/1 CrashLoopBackOff 12 46m
rasa-x-rasa-x-5c786bdb6d-4glgb 0/1 CrashLoopBackOff 13 46m
rasa-x-redis-master-0 2/2 Running 0 27m
Events:
[REDACTED] 3m43s Warning Unhealthy pod/rasa-x-duckling-d6f66bcd7-xtv5b Liveness probe failed: Get "http://10.4.4.111:8000/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED] 3m40s Warning Unhealthy pod/rasa-x-duckling-d6f66bcd7-xtv5b Readiness probe failed: Get "http://10.4.4.111:8000/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[REDACTED] 3m53s Warning Unhealthy pod/rasa-x-duckling-d6f66bcd7-xtv5b Liveness probe failed: Get "http://10.4.4.111:8000/": dial tcp 10.4.4.111:8000: i/o timeout (Client.Timeout exceeded while awaiting headers)
[REDACTED] 33s Warning Unhealthy pod/rasa-x-rasa-worker-7787cf478b-dsxhg Liveness probe failed: Get "http://10.4.5.173:5005/": dial tcp 10.4.5.173:5005: connect: connection refused
[REDACTED] 72s Warning Unhealthy pod/rasa-x-event-service-5b57fc875d-txh59 Liveness probe failed: Get "http://10.4.1.19:5673/health": dial tcp 10.4.1.19:5673: connect: connection refused
[REDACTED] 81s Warning Unhealthy pod/rasa-x-event-service-5b57fc875d-txh59 Readiness probe failed: Get "http://10.4.1.19:5673/health": dial tcp 10.4.1.19:5673: connect: connection refused
[REDACTED] 28s Warning Unhealthy pod/rasa-x-rasa-production-788f44854f-fb24x Liveness probe failed: Get "http://10.4.1.16:5005/": dial tcp 10.4.1.16:5005: connect: connection refused
[REDACTED] 23s Warning Unhealthy pod/rasa-x-rasa-x-5c786bdb6d-4glgb Liveness probe failed: Get "http://10.4.6.152:5002/": dial tcp 10.4.6.152:5002: connect: connection refused
[REDACTED] 32s Warning Unhealthy pod/rasa-x-rasa-x-5c786bdb6d-4glgb Readiness probe failed: Get "http://10.4.6.152:5002/": dial tcp 10.4.6.152:5002: connect: connection refused
[REDACTED] 2s Warning Unhealthy pod/rasa-x-db-migration-service-0 Readiness probe failed: HTTP probe failed with statuscode: 500
[REDACTED] 0s Warning Unhealthy pod/rasa-x-db-migration-service-0 Liveness probe failed: HTTP probe failed with statuscode: 500
A problem that I see evident is the fact that if there is an external installation of Redis the password is not passed due to these conditions:
I manually added the environment variable to the deployment.
- name: REDIS_PASSWORD
valueFrom:
secretKeyRef:
key: REDIS_PASSWORD
name: [REDACTED]
And I removed Readiness and Liveness to get the pods running.
With these changes there has been a positive evolution as it still does not work 100%.
At this point db-migration-service started complaining about not being able to connect to postgres.
Process ForkProcess-1:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect
return fn()
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 304, in unique_connection
return _ConnectionFairy._checkout(self)
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 778, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 495, in checkout
rec = pool._do_get()
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/impl.py", line 140, in _do_get
self._dec_overflow()
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
compat.raise_(
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/compat.py", line 182, in raise_
raise exception
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/impl.py", line 137, in _do_get
return self._create_connection()
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 309, in _create_connection
return _ConnectionRecord(self)
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 440, in __init__
self.__connect(first_connect_check=True)
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 661, in __connect
pool.logger.debug("Error on connect(): %s", e)
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
compat.raise_(
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/compat.py", line 182, in raise_
raise exception
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 656, in __connect
connection = pool._invoke_creator(self)
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/strategies.py", line 114, in connect
return dialect.connect(*cargs, **cparams)
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 508, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/usr/local/lib/python3.8/dist-packages/psycopg2/__init__.py", line 127, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: could not connect to server: Connection timed out
Is the server running on host "[REDACTED].postgres.database.azure.com" (104.40.169.187) and accepting
TCP/IP connections on port 5432?
So I had to manually create an egress policy for it to work. This is because your allow-dns-access
policy blocks all connections from all pods in the namespace except for port 53 and specific egress policies.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-postgres-access
namespace: [REDACTED]
spec:
egress:
- ports:
- port: 5432
protocol: UDP
- port: 5432
protocol: TCP
podSelector: {}
policyTypes:
- Egress
After that db-migration-service
completed the process.
Pod: db-migration-service
INFO:__main__:The database migration has finished. DB revision: ['652500998f3e']
[2021-07-15 11:19:18 +0000] - (sanic.access)[INFO][10.4.0.122:39392]: GET http://10.4.1.16:8000/health 200 56
[IDENTICAL LINES CUT]
[2021-07-15 11:56:10 +0000] - (sanic.access)[INFO][10.4.0.122:43312]: GET http://10.4.1.16:8000/health 200 56
The event-service
pod is still in `CrashLoopBackOff.
Pod: event-service
Check for database migrations completed.
INFO:__main__:Starting event service (standalone: True).
INFO:rasax.community.services.event_consumers.event_consumer:Started Sanic liveness endpoint at port '5673'.
[2021-07-15 14:05:04 +0000] [19] [INFO] Goin' Fast @ http://0.0.0.0:5673
[2021-07-15 14:05:04 +0000] [19] [INFO] Starting worker [19]
Pod: duckling
Listening on http://0.0.0.0:8000
Then RabbitMQ (custom installation) started to complaining because was not able to connect to K8S API so I had to create another NP:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-access
namespace: [REDACTED]
spec:
egress:
- ports:
- port: 443
protocol: UDP
- port: 443
protocol: TCP
podSelector: {}
policyTypes:
- Egress
After that RabbitMQ started to work properly. So also rasa-production
, rasa-x
and rasa-worker
started to work properly.
rasa-x-db-migration-service-0 1/1 Running 0 173m
rasa-x-duckling-d6f66bcd7-xx4dm 1/1 Running 0 4h10m
rasa-x-event-service-94d585499-4cl66 1/1 Running 0 5m41s
rasa-x-rabbitmq-0 1/1 Running 0 10m
rasa-x-rasa-production-64d8589788-p4kth 1/1 Running 0 5m55s
rasa-x-rasa-worker-5867678fd9-f6c22 1/1 Running 0 5m21s
rasa-x-rasa-x-758d9f5f58-h9vnk 1/1 Running 0 3h52m
rasa-x-redis-master-0 2/2 Running 0 4h24m
Now everything is working correctly (with my workaround), at least so it seems from the logs. We will then try to test the application and leave room for further reports.
To conclude, therefore, from our point of view, two network policies are missing, the rasa
, event-service
and rasa-x
templates must be corrected for that wrong condition and the correct label is missing to open traffic to Redis
.
@tczekajlo a small update related to the tests performed. We found that an additional network policy is missing because pods can't make SSH calls to GitHub to download repositories.
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-ssh-access
namespace: [REDACTED]
spec:
egress:
- ports:
- port: 22
protocol: UDP
- port: 22
protocol: TCP
podSelector: {}
policyTypes:
- Egress
@tmbo @tczekajlo @JustinaPetr I would love to have feedback. โค๏ธ
@pierluigilenoci I'll be able to take a look at it probably next week
@tczekajlo did you find the time to take a look at it?
@tczekajlo any update on this?
@tczekajlo it's been 76 days since I opened the issue and 42 days since you said you'd take a look at it. Is there any news?
Maybe @RASADSA or @rgstephens can help out?
@pierluigilenoci if i understand all correct you run all under AKS with calico CNI. The provided Network Policies inside the rasa x helmchart are meant as a blueprint to map out your own network policies. On certain setups they will work out of the box - not on all. Since there are a lot of CNI's out there in the kubernetes field and a lot of have different usage of the network policies its impossible for us to determine every case and debug it remotely. Especially Full Managed K8S Cloud provideres often are a blackbox on the CNI visibility point. I recommend to disable the rasa X network policies and walk through the official Azure AKS Network policy documentation https://docs.microsoft.com/en-us/azure/aks/use-network-policies to create your own attuned network policies.
@RASADSA if you read all my comments on the issue I have explicitly documented all the problems of the current chart. Making a PR to correct the problem should be simple and above all "generalizable". There are no fixes that are specific to the cloud implementation.
@pierluigilenoci i recommend to wait for @tczekajlo return - since im not into that topic and i dont want to dismiss your effort.
per general every Kubernetes Administrator is accountable for their network policies - not the helm chart creator.
https://kubernetes.io/docs/concepts/services-networking/network-policies/
Network policies are implemented by the network plugin. To use network policies, you must be using a networking solution which supports NetworkPolicy. Creating a NetworkPolicy resource without a controller that implements it will have no effect.
Kubernetesโ adoption of the CNI standard allows for many different network solutions to exist within the same ecosystem. The diversity of options available means that most users will be able to find a CNI plugin that suits their current needs and deployment environment, while also providing solutions when their circumstances change. Operating requirements vary immensely between organisations, so having a number of mature solutions with different levels of complexity and feature richness helps Kubernetes satisfy unique requirements while still offering a fairly consistent user experience.
in my experience it makes a huge difference on network policies if you run them inside AWS / AKS / GCE or baremetal. Depending on what kind security concept you follow its hard to generalize network policies for all the different variations of CNI implementations.
Calico CNI as example: https://docs.projectcalico.org/getting-started/kubernetes/
Cloud Managed
self managed
etc. on top of it VXLAN, BGP, nodeport traffic, clusterip, Loadblancer / External Services.
a lot of people already using this helmchart and we should be very careful what we change on network policy level which could be just rolled out on the next helm chart upgrade run.
@RASADSA I understand the concern in introducing changes to NPs but, as is customary, just release the chart with an appropriate version change, a note for breaking changes, and if you really want to be conservative a migration guide.
As I wrote my additions to NPs are safe and unrelated to implementation. They only open more ports starting from a condition of "all traffic blocked to and from the namespace".
My idea is that a helm chart, apart from the configurations/customizations of the values, should work without external intervention.
And anyway, I can also understand that you don't care about solving the problem because it doesn't affect you directly.
@RASADSA @tczekajlo any update?
@pierluigilenoci after a longer discussion internally we will not extend the network policies and close the issue. We dont have the capacity to support multiple CNI's Network Policies and make sure that they always work. On the Topic on your own networkpolicies out of the box:
@RASADSA I accept your choice but do not agree with it.
I am quite disappointed but certainly not surprised. For me, an open-source project should be managed differently. But they are obviously opinions and everyone has their own.
@RASADSA should I open a separate issue for the Redis install issue then?
That was also not addressed as pointed out by @pierluigilenoci in https://github.com/RasaHQ/rasa-x-helm/issues/211#issuecomment-880736816
@RASADSA so... in the end, it looks like I was right. ๐
Ref: https://github.com/RasaHQ/rasa-x-helm/pull/275 https://github.com/RasaHQ/rasa-x-helm/pull/282
Hello, we installed RasaX with this chart and this values.yaml file:
We have a standalone RabbitMQ and Redis instance running in the same namespace.
All pods work fine except
rasa-x
andevent-service
in CrashLoopBack with these error messages:Logs from
rasa-x
pod:No logs from
event-service
.Pods:
rasa-x
events:event-service
events:How to solve it?