Heartbeat monitoring broken for camel web service 0.2.7+

rlratcliffe commented 1 year ago

Reference to related issue in charts repo.

If users upgrade to 0.2.7 of the Tavros helm chart, then heartbeat monitoring will be broken. The way the virtual port was re-exposed caused duplicate listener issues creating a lot of issues with uptime of pods so it needs to be removed. However, since heartbeat is inside of elastic-system namespace, it is outside of the prod and sandbox meshes. Heartbeat needs to be able to access the actuator port of the service (example: http://api-test-camel-web-service.dev.svc.cluster.local:8080/actuator/health/liveness) within both the prod and sandbox meshes.

Some ideas from the folks at Kong/Kuma:

gateway for each mesh
deploy heartbeat inside of each mesh and then somehow pull in the results back into the elastic-system namespace (maybe as ExternalService?)

rlratcliffe commented 1 year ago

Acceptance tests:

Given an api and curl shell deployed in each namespace (dev,test,prod):

Namespace	Test	Expected result
PROD shell	curl 'http://api-test-camel-web-service.prod.svc.cluster.local:8080/actuator/health/liveness'	{"status":"UP"}
PROD shell	curl 'http://api-test-camel-web-service.dev.svc.cluster.local:8080/actuator/health/liveness' curl 'http://api-test-camel-web-service.test.svc.cluster.local:8080/actuator/health/liveness'	Empty reply from server
DEV shell	curl 'http://api-test-camel-web-service.prod.svc.cluster.local:8080/actuator/health/liveness'	Empty reply from server
DEV shell	curl 'http://api-test-camel-web-service.dev.svc.cluster.local:8080/actuator/health/liveness' curl 'http://api-test-camel-web-service.test.svc.cluster.local:8080/actuator/health/liveness'	{"status":"UP"}
TEST shell	curl 'http://api-test-camel-web-service.prod.svc.cluster.local:8080/actuator/health/liveness'	Empty reply from server
TEST shell	curl 'http://api-test-camel-web-service.dev.svc.cluster.local:8080/actuator/health/liveness' curl 'http://api-test-camel-web-service.test.svc.cluster.local:8080/actuator/health/liveness'	{"status":"UP"}

&

Test	Expected result	Actual result	PASS/FAIL
Log into Kibana -> Observability -> Uptime	Uptime monitoring works for prod pods
Log into Kibana -> Observability -> Uptime	Uptime monitoring works for dev pods
Log into Kibana -> Observability -> Uptime	Uptime monitoring works for test pods

jam01 commented 1 year ago

Hey @rlratcliffe I believe the kong DPs are already configured to be gateways. Perhaps there's a good way to create a route/service that proxies the request to the probes (wonder if the host can be dynamic). Then possibly make that service only internal to the cluster, maybe IP whitelisting...?

The other solution may also be possible, though I don't exactly remember how/where the heartbeat component runs

rlratcliffe commented 1 year ago

hey @jam01 decided to go with a totally different solution for now, as it seemed easier/safer to configure this way, which is to create 3 different heartbeat instances:

dev instance with a sidecar with the sandbox mesh
test instance with a sidecar with the sandbox mesh
prod instance with a sidecar with the prod mesh

this way heartbeat stays inside of the same namespace and each instance looks only at the specific namespace of the pods so there's no conflicts for the instances. they don't seem to take up too many resources, although I only have 1 API in my test cluster. done a lot of tests in my personal cluster and it seems ok. the person I talked to in the kuma slack thought it was an ok approach. I'll create a PR at some point, although #102 would need to be merged first.

rlratcliffe commented 1 year ago

I will make both 0.2.7 and 0.2.8 chart releases pre-releases with notes related to this issue in the meantime.

jam01 commented 1 year ago

Don't quite remember the distinction between dev and test... But yeah if the resources taken by sidecars is not significant then no worries, though it may be significant if there's thousands of pods.

Though if the deployments work as side cars, that means they're deployed in namespaces different than elastic... Which means that daemonsets in sandbox and production namespaces could also work somehow.

Either way, you obviously already have a functional solution :)

rlratcliffe commented 1 year ago

thanks for chiming in :)

might not be an important distinction but, my understanding is the heartbeat instances are still in elastic-system. it's similar, I think, to how in the kong namespace there isn't a mesh defined, but prod and sandbox releases each have sidecars and so each release can communicate with the necessary prod/dev/test namespaces. just by defining the mesh per sidecar.

jam01 commented 1 year ago

Hmm what I'm getting is that the HB CR is in the elastic ns, but that the sidecars are injected in each app's pod. Is that correct? If that's correct then that means the HB agent can be outside of the elastic ns and send the beats to the collector in the elastic ns

-------- Original Message -------- On Oct 16, 2023, 8:53 PM, Rob Ratcliffe wrote:

thanks for chiming in :)

might not be an important distinction but just for reference, my understanding is the heartbeat instances are still in elastic-system. it's similar, I think, to how in the kong namespace there isn't a mesh defined, but prod and sandbox releases each have sidecars and so each release can communicate with the necessary prod/dev/test namespaces. just by defining the mesh per sidecar.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

rlratcliffe commented 1 year ago

decided to keep with stated plan for now.

PR ready: https://github.com/MS3Inc/tavros/pull/104

MS3Inc / tavros

Heartbeat monitoring broken for camel web service 0.2.7+ #103