MS3Inc / tavros

A modern and modular integration platform composed of best-of-breed open-source components.
Apache License 2.0
48 stars 9 forks source link

Heartbeat monitoring broken for camel web service 0.2.7+ #103

Open rlratcliffe opened 11 months ago

rlratcliffe commented 11 months ago

Reference to related issue in charts repo.

If users upgrade to 0.2.7 of the Tavros helm chart, then heartbeat monitoring will be broken. The way the virtual port was re-exposed caused duplicate listener issues creating a lot of issues with uptime of pods so it needs to be removed. However, since heartbeat is inside of elastic-system namespace, it is outside of the prod and sandbox meshes. Heartbeat needs to be able to access the actuator port of the service (example: http://api-test-camel-web-service.dev.svc.cluster.local:8080/actuator/health/liveness) within both the prod and sandbox meshes.

Some ideas from the folks at Kong/Kuma:

rlratcliffe commented 11 months ago

Acceptance tests:

Given an api and curl shell deployed in each namespace (dev,test,prod):

Namespace Test Expected result Actual result PASS/FAIL
PROD shell curl 'http://api-test-camel-web-service.prod.svc.cluster.local:8080/actuator/health/liveness' {"status":"UP"}
PROD shell curl 'http://api-test-camel-web-service.dev.svc.cluster.local:8080/actuator/health/liveness' curl 'http://api-test-camel-web-service.test.svc.cluster.local:8080/actuator/health/liveness' Empty reply from server
DEV shell curl 'http://api-test-camel-web-service.prod.svc.cluster.local:8080/actuator/health/liveness' Empty reply from server
DEV shell curl 'http://api-test-camel-web-service.dev.svc.cluster.local:8080/actuator/health/liveness' curl 'http://api-test-camel-web-service.test.svc.cluster.local:8080/actuator/health/liveness' {"status":"UP"}
TEST shell curl 'http://api-test-camel-web-service.prod.svc.cluster.local:8080/actuator/health/liveness' Empty reply from server
TEST shell curl 'http://api-test-camel-web-service.dev.svc.cluster.local:8080/actuator/health/liveness' curl 'http://api-test-camel-web-service.test.svc.cluster.local:8080/actuator/health/liveness' {"status":"UP"}

&

Test Expected result Actual result PASS/FAIL
Log into Kibana -> Observability -> Uptime Uptime monitoring works for prod pods
Log into Kibana -> Observability -> Uptime Uptime monitoring works for dev pods
Log into Kibana -> Observability -> Uptime Uptime monitoring works for test pods
jam01 commented 10 months ago

Hey @rlratcliffe I believe the kong DPs are already configured to be gateways. Perhaps there's a good way to create a route/service that proxies the request to the probes (wonder if the host can be dynamic). Then possibly make that service only internal to the cluster, maybe IP whitelisting...?

The other solution may also be possible, though I don't exactly remember how/where the heartbeat component runs

rlratcliffe commented 10 months ago

hey @jam01 decided to go with a totally different solution for now, as it seemed easier/safer to configure this way, which is to create 3 different heartbeat instances:

this way heartbeat stays inside of the same namespace and each instance looks only at the specific namespace of the pods so there's no conflicts for the instances. they don't seem to take up too many resources, although I only have 1 API in my test cluster. done a lot of tests in my personal cluster and it seems ok. the person I talked to in the kuma slack thought it was an ok approach. I'll create a PR at some point, although #102 would need to be merged first.

rlratcliffe commented 10 months ago

I will make both 0.2.7 and 0.2.8 chart releases pre-releases with notes related to this issue in the meantime.

jam01 commented 10 months ago

Don't quite remember the distinction between dev and test... But yeah if the resources taken by sidecars is not significant then no worries, though it may be significant if there's thousands of pods.

Though if the deployments work as side cars, that means they're deployed in namespaces different than elastic... Which means that daemonsets in sandbox and production namespaces could also work somehow.

Either way, you obviously already have a functional solution :)

rlratcliffe commented 10 months ago

thanks for chiming in :)

might not be an important distinction but, my understanding is the heartbeat instances are still in elastic-system. it's similar, I think, to how in the kong namespace there isn't a mesh defined, but prod and sandbox releases each have sidecars and so each release can communicate with the necessary prod/dev/test namespaces. just by defining the mesh per sidecar.

jam01 commented 10 months ago

Hmm what I'm getting is that the HB CR is in the elastic ns, but that the sidecars are injected in each app's pod. Is that correct? If that's correct then that means the HB agent can be outside of the elastic ns and send the beats to the collector in the elastic ns

-------- Original Message -------- On Oct 16, 2023, 8:53 PM, Rob Ratcliffe wrote:

thanks for chiming in :)

might not be an important distinction but just for reference, my understanding is the heartbeat instances are still in elastic-system. it's similar, I think, to how in the kong namespace there isn't a mesh defined, but prod and sandbox releases each have sidecars and so each release can communicate with the necessary prod/dev/test namespaces. just by defining the mesh per sidecar.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

rlratcliffe commented 10 months ago

decided to keep with stated plan for now.

PR ready: https://github.com/MS3Inc/tavros/pull/104