featurehub-io / featurehub

FeatureHub - cloud native feature flags, A/B testing and remote configuration service. Real-time streaming feature updates. Provided with Java, JavaScript, React, Python, Go, .Net, Ruby, Android, Swift and Flutter SDKs.
https://www.featurehub.io
Other
314 stars 31 forks source link

Service Not Returning fetures After Minikube Cluster Restart #1057

Closed igabba closed 1 year ago

igabba commented 1 year ago

Describe the bug I am running everything in Minikube and installed it using Helm. I've noticed that every time I restart the cluster, either manually or due to a laptop restart, the service doesn't return any results when I attempt to retrieve the features.

I tried several troubleshooting steps, including restarting the Datcha pods, but with no success. However, I found a solution by updating some permissions on the service account permission tab. After doing so, the service started working again. I suspect that this action may have caused the cache to be populated anew or something similar.

Interestingly, I don't encounter this issue when using Docker Compose (all-separate-postgress). I've been able to restart it multiple times without encountering any problems.

Which area does this issue belong to?

To Reproduce Steps to reproduce the behavior:

  1. Install using helm in minikube
  2. Configure an app, permissions, etc
  3. go to http://{cluster-ip}/features/{api-key} and the list of features are there
  4. stop k8s cluster
  5. start k8s cluste
  6. go to http://{cluster-ip}/features/{api-key} and only shows evets ack and bye

Expected behavior The features should be there after a cluster restart

Versions

rvowles commented 1 year ago

which helm install are you using?

igabba commented 1 year ago

Hi, I followed this instructions https://github.com/featurehub-io/featurehub-helm/tree/main

rvowles commented 1 year ago

Heya - those instructions use Kind which is what we use for testing. When you have the cluster up, and do a

kubectl get all -n <featurehub-namespace> 

what do you get and when you restart the cluster, doing the same, what do you get?

I am not familiar with Minikube I'm sorry, but this may help us diagnose what has happened with the services.

igabba commented 1 year ago

I get the same.

This one is before restart.

featurehub-without-restart

This one is post cluster restart. featurehub-post-restart

After that I made a rollout of dacha, dacha restarted but, again, when I try to get features I only get

image

Tell me if some logs helps you. Thanks

rvowles commented 1 year ago

My theory is that the request from Dacha -> MR is timing out, so on that restart if you can drop in the Dacha logs here? There is a 12 second timeout (its in the helm chart) that can be changed to a higher value if thats the case.

igabba commented 1 year ago

I can see this error in dacha pod:

{"@timestamp":"2023-09-15T21:29:52.468+0000","message":"Failed jersey request","priority":"ERROR","path":"io.featurehub.jersey.config.LocalExceptionMapper","thread":"grizzly-http-server-1","stack_trace":"jakarta.ws.rs.NotFoundException: HTTP 404 Not Found\n\tio.featurehub.dacha2.resource.DachaApiKeyResource.getApiKeyDetails(DachaApiKeyResource.kt:16) ~[dacha2-1.1-SNAPSHOT.jar:?]\n\tjdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]\n\tjdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) ~[?:?]\n\tjdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) ~[?:?]\n\tjava.lang.reflect.Method.invoke(Unknown Source) ~[?:?]\n\torg.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[jersey-server-3.1.1.jar:?]\n\torg.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134) ~[jersey-server-3.1.1.jar:?]\n\torg.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177) ~[jersey-server-3.1.1.jar:?]\n\torg.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219) ~[jersey-server-3.1.1.jar:?]\n\torg.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81) ~[jersey-server-3.1.1.jar:?]\n\torg.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) ~[jersey-server-3.1.1.jar:?]\n\torg.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400) ~[jersey-server-3.1.1.jar:?]\n\torg.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) ~[jersey-server-3.1.1.jar:?]\n\torg.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:261) [jersey-server-3.1.1.jar:?]\n\torg.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [jersey-common-3.1.1.jar:?]\n\torg.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [jersey-common-3.1.1.jar:?]\n\torg.glassfish.jersey.internal.Errors.process(Errors.java:292) [jersey-common-3.1.1.jar:?]\n\torg.glassfish.jersey.internal.Errors.process(Errors.java:274) [jersey-common-3.1.1.jar:?]\n\torg.glassfish.jersey.internal.Errors.process(Errors.java:244) [jersey-common-3.1.1.jar:?]\n\torg.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) [jersey-common-3.1.1.jar:?]\n\torg.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:240) [jersey-server-3.1.1.jar:?]\n\torg.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:697) [jersey-server-3.1.1.jar:?]\n\torg.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:367) [jersey-container-grizzly2-http-3.1.1.jar:?]\n\torg.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:190) [grizzly-http-server-4.0.0.jar:4.0.0]\n\torg.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:535) [grizzly-framework-4.0.0.jar:4.0.0]\n\torg.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:515) [grizzly-framework-4.0.0.jar:4.0.0]\n\tjava.lang.Thread.run(Unknown Source) [?:?]","host":"featurehub-dacha-848ffd48f-9l2ht"}

Hope it helps

rvowles commented 1 year ago

We're going to need to turn on trace level logging to figure this out... in the values.yaml file in the helm chart, there is a section around lline 68 that has an XML comment, if you remove it, i.e.:

<!--
... logging detail
-->

Get rid of the <!-- and --> and allso add in the line

<AsyncLogger name="io.featurehub.dacha2" level="trace"/>

Just make sure it is properly indented, I should really change that to be a file import. And then do the same thing, you should see REST traffic incoming from Edge and then outgoing to MR and then an appropriate response. When you hit it when its fresh it will do this. You don't need to undeploy, it has a 30 second timer to rescan the log configuration.

Try just bouncing the dacha instance rathert han restarting the whole clluster first and see if it recurs, if not, follow the bounce cluster effect.

When it looks for an environment id + service account that it doesn't know about, it will ask MR and then cache it from then on. The implication is that MR is saying it doesn't exist but that seems incongruous.

Thanks for persevering in this!

igabba commented 1 year ago

Well, I've modified this lines and I get some additional info. I'll attach logs (only the relevant parts) from dacha and edge.

edge-logs.txt trace-dacha2.txt

igabba commented 1 year ago

Well I realize that if I add the port 8701 in management-repository/service.xml everything works.

image

Is there something I'm missing in the installation steps?

rvowles commented 1 year ago

Its part of the definition of the Management Repository deployment.yaml file:

image
rvowles commented 1 year ago

It is here: https://github.com/featurehub-io/featurehub-helm/blob/main/helm/featurehub/templates/management-repository/service.yaml#L15

I expect then thats because prometheus is not enabled, ergo it doesn't make that port available on the service which is causing the problem? Seems weird it worked in the first place for you then? I better correct that.

igabba commented 1 year ago

Hello. Indeed I saw that configuration in the service.yaml but if I enabled Prometheus it gave me an error because I don't have it installed. Therefore I added another port by hand and it worked. If I didn't misunderstand the architecture, it could be that at first it worked for me because dacha receives notifications from nat directly, right? However, once I restart the cluster it will try to go to the management-repository to populate the cache. But I didn't have the port enabled there, could it be?

rvowles commented 1 year ago

Yeah, that makes sense - thats likely what it is!

rvowles commented 1 year ago

I've released 4.0.4 now and that should fix it. Just waiting for ArtifactHub to pick it up.

rvowles commented 1 year ago

OK, 1.6.3. has been released and the chart is on this version so we should be all good :-)