Readiness reported as success when FeatureHub is down

amandam11 commented 1 year ago

Describe the bug If my client code successfully manages to connect to FeatureHub at startup, then readiness is reported as Ready. I have added both a timer to regular check readiness, and also registered a ReadinessListener. However, if I stop featurehub then the client still always report that the readyness is true.

Which area does this issue belong to?

[X] SDK
[ ] SDK examples
[ ] Documentation
[ ] Other

To Reproduce Steps to reproduce the behavior:

Create a client using EdgeFeatureConfig, add a readiness listener, and also a timer that regularly calls config.readiness. We have trace mapped to a console.verbose for this test, so we can see what messages we get on connection failure.

The readyness listener has a simple log statement to report the readyness, eg.. console.log('Raw listener readyness is now ' + readiness);

The timer also reports readyness, and simply has:

const newReadiness = config.readyness;
console.log('Readyness is now ' + newReadiness);

Wait for client to report that the featurehub listener is ready, and we therefore see logging of:

LOG Raw listener readyness is now NotReady
LOG listening at ,http://featurehub:8085/features/<key>
VERBOSE received ack,[object Object]
VERBOSE received features,[object Object]
LOG Raw listener readyness is now Ready
LOG Readyness is now Ready

Whereby the VERBOSE log is from the fhLog trace, and the readynesslistener is reporting the NotReady and Ready, and the timer then goes off and sees ready.

Leave FeatureHub up for a while and we see logs showing the regular stale connection check and readyness reported as Ready

LOG Readyness is now Ready
VERBOSE received bye,[object Object]
VERBOSE  refreshing connection in case of staleness,[object Object]
LOG Readyness is now Ready
VERBOSE received ack,[object Object]
LOG Readyness is now Ready

Stop FeatureHub, and we see from logs that the readyness listener is never called, and readyness is still reported as true by the timer that is regularly going off.


VERBOSE received bye,[object Object]
VERBOSE refreshing connection in case of staleness,[object Object]
LOG Readyness is now Ready
LOG Readyness is now Ready
VERBOSE refreshing connection in case of staleness,[object Object]
LOG Readyness is now Ready
LOG Readiness is now Ready
VERBOSE refreshing connection in case of staleness,[object Object]

It never seems to discover that FeatureHub is down, as FeatureHub can be left down for a long time, and still we see this cycle.

6. When FeatureHub is restarted we see the connection ack from the trace and readyness still reported as true by the timer.

VERBOSE received ack,[object Object] VERBOSE received features,[object Object] LOG Readyness is now Ready


**Expected behavior**
When FeatureHub is not running then we expect that the readinesslistener is called with NotReady, and that the readiness flag if called would report true.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Versions**
- FeatureHub version [1.6.0]
- SDK version [1.3.0]
- OS: [Docker image featurehub/party-server, Docker version 23.0.6, Ubuntu 20.04]
- Browser [Chrome]

You can get the version of the FeatureHub container by running ```docker ps``` command.
Please include the OS and what version of the OS and Docker you're running.

**Additional context**
Putting in some extra debug then there is different detailed in the `refreshing connection in case of staleness` in the case of when featurehub is down.

When it is the standard refresh then the args on the refresh are:

[{"type":"error"}]


but when featurehub is down we see:

[{"type":"error","message":"getaddrinfo EAI_AGAIN featurehub"}]


where featurehub is the name of the dockercontainer that I was running the featurehub party/server in.

rvowles commented 1 year ago

HI Amanda, questions like these are good because then I can add extra documentation to the "How the SDK works" docs :-)

The docs for repository readiness are here: https://docs.featurehub.io/featurehub/1.7.0/sdks-development.html#_repository_readiness

Because the repository is a cache, which depending on how the connector connects and updates - may be out of date for a period of time, its important that it stays Ready once it has received its first set of features. So once it moves from Not Ready -> Ready, it shouldn't ever (except in one circumstance) move back to Not Ready. It would only change to 'Failed".

The situation whereby it will return to not ready is when you are using a server evaluated key and you change the context and "build" - then the repository does not reflect in any real sense the right feature values, so it swaps back to NotReady.

Its quite important particularly for node servers not to stop being ready otherwise if your FeatureHub install goes down, then all your server apps would go down as well, and if your FeatureHub install is down, its not updating any features so its not really a problem there is no connection.

amandam11 commented 1 year ago

HI Amanda, questions like these are good because then I can add extra documentation to the "How the SDK works" docs :-)

The docs for repository readiness are here: https://docs.featurehub.io/featurehub/1.7.0/sdks-development.html#_repository_readiness

Because the repository is a cache, which depending on how the connector connects and updates - may be out of date for a period of time, its important that it stays Ready once it has received its first set of features. So once it moves from Not Ready -> Ready, it shouldn't ever (except in one circumstance) move back to Not Ready. It would only change to 'Failed".

The situation whereby it will return to not ready is when you are using a server evaluated key and you change the context and "build" - then the repository does not reflect in any real sense the right feature values, so it swaps back to NotReady.

Its quite important particularly for node servers not to stop being ready otherwise if your FeatureHub install goes down, then all your server apps would go down as well, and if your FeatureHub install is down, its not updating any features so its not really a problem there is no connection.

Thank you for your explanation, and quick response. Sorry I missed that in the docs. I misunderstood from reading the SDK README (https://github.com/featurehub-io/featurehub-javascript-sdk#what-is-the-deal-with-readyness), so a link from the README to the page you suggested above might make it clearer for others to avoid confusion. It would be nice if from the logs we could tell that we were using cached features. For example if there was a network issue that meant that we couldn't reach FeatureHub, but FeatureHub was up, at the moment we wouldn't realise this without having featurehub trace on. Or a way for us to test to see if we have an active connection, that we can then log. So that in case of a network issue we can see that.

rvowles commented 1 year ago

If I exposed the Edge connector as a queryable status for a prometheus metric, would that be more useful? Or should we do both? I am thinking of looking for an env var that would trigger logging connection state?

rvowles commented 1 year ago

Oh, and by the way, you didn't miss it in the docs :-) I wrote it to clarify how the readiness works essentially as a FAQ and for any new SDKs, we keep adding to the SDKs docs as we build new SDKs.

Would maybe a different FHLog level for Edge connector issues be more suitable then you can control it? Suggestions are welcome!

amandam11 commented 1 year ago

Oh, and by the way, you didn't miss it in the docs :-) I wrote it to clarify how the readiness works essentially as a FAQ and for any new SDKs, we keep adding to the SDKs docs as we build new SDKs.

Great addition to the docs then -> it's nice and clear!

What we've ended up doing is having to pick out the logs whereby we see "refreshing connection in case of staleness" AND it has error.message - as that seems to be the case whereby we've actually got a problem connecting, compared to the "refreshing connection in case of staleness" which just has {"type":"error"} as we didn't want to also log all the standard refresh connection messages. Then to know we've got it back then we need to find the ack messages, but again to not log all the standard stale connection re-establishment, we couldn't distinguish between re-establishment due to just the standard drop/connect, and a re-establishment based on actually having a problem with the connection. So we ended up having to keep track of whether we'd previously had an error connection, so keeping state in our abstraction layer.

We ended up, deciding logging was enough - so some easier way of picking up the logs would make it easier, so if those were at a new log-level that would make it easier. Ideally if it was easier to find out if we were recovering a previously failed connection, compared to a just refresh from staleness would be great. But I'm not sure if actually there really isn't any difference, as you only notice the failure because you are refreshing the connection - so essentially the logic really is the same in both cases.

rvowles commented 1 year ago

Ok, I'll have a think about how I can better surface this as SSE is often quite tricky to ascertain if you got an actual error or the connection dropped because of staleness as you say. Maybe a threshold of failure? like if you get a reconnect failure X times in a row we should start logging it?

chkp-roica commented 5 months ago

Hey, Somewhat related to this, is it possible to identify an issue after connectivity was established? For example, say there's a networking issue to FeatureHub server and not the server itself. (BTW, I'm using the node.js SDK).

I ran a local node.js app that successfully connected to FH server, and then I just disconnected my LAN cable to see if some kind of error will appear, but didn't get anything in the logs nor did the readiness status change.

I think it's important because the FH server might still be up&running allowing users to still change flags but the app itself doesn't get updates.

rvowles commented 5 months ago

Yes, it is probably worthwhile to add a callback when there is any kind of connectivity issue. There are some changes coming down the line around how SSE and polling are handled as well that have been trialled in a couple of other SDKs so we will be sure to add this capability in.

featurehub-io / featurehub-javascript-sdk

Readiness reported as success when FeatureHub is down #160