Open jdpgrailsdev opened 1 year ago
@timroes FE questions:
@davinchia The UI hits this endpoint every 20s and if it fails 3 times it will simply show a "server not reachable" toast notification at the bottom. It does not pause queries or limit anything else. I'm honestly not sure if that's still something we should be doing at all. The user will anyway notice the server is not reachable once their next actual request fails, and seeing that banner usually won't prevent them from actually continue using the app and anyway run into other erroneous requests. I though don't have the background information why this check was build in the first place and if there's maybe any reason this is happening.
Examples of usage of Micronaut caching:
https://github.com/airbytehq/airbyte-platform-internal/blob/master/oss/airbyte-config/init/src/main/java/io/airbyte/config/init/RemoteDefinitionsProvider.java#L40 https://github.com/airbytehq/airbyte-platform-internal/blob/master/oss/airbyte-config/init/src/main/java/io/airbyte/config/init/RemoteDefinitionsProvider.java#L113
More details here
@cgardens do you remember why the webapp checks the health endpoint? Is it so OSS users know when to restart a 'stuck' OSS server?
Considering getting rid of this as it creates non-trivial DB load. Gathering context now.
We just merged a PR that removes the health check in the FE for cloud: https://github.com/airbytehq/airbyte-platform-internal/pull/6694 Once that is deployed I hope we'll see way less load on this API.
My context:
Yes, the OSS PR was partially what sparked this conversation.
Some history - Justen's last comment was something we ran into a year ago on Cloud. We had a connection leak and the server's connection pool was entirely used up leaving it unresponsive to the UI and jobs. This caused a fairly big P0. Back that, the health endpoint did not check the database and only returned a 200 once the server is started up. The simplest fix was to make the health endpoint includes the database. We have not touched this since.
Although I agree on the general principle of readiness not including downstream dependencies, I don't think it applies as much to us here. The argument there is to prevent downtime amplification across the system and is more relevant across a large microservices architecture. We only have one layer - so this effect is minimal. Further, the server does require the database as a hard dependency as currently architected. We could make it return a 503 instead of being removed from service, however I feel like there is little value there.
Agree on the point of separating out the readiness/liveness endpoints. I would say that healthy = readiness so would push for them to be equivalent.
Hey folks! Author of airbytehq/airbyte-platform#221 here
Thank you @jdpgrailsdev and @davinchia for giving this some attention. I've personally had outages become worse because I was visiting the Console to find out what was going on and the constant stream of health checks from the UI started to slow the DB down even further. I think its a great idea to re-evaulate if these check are necessary and/or speed them up if so.
With regards to the OSS PR (ie: what should be tested in each probe) - I think it might be helpful to put aside what documented best practices are and even what it means for AirByte to be "ready", and instead focus on what the outcomes will be when probe failures happen. I hope the following may be able to spark some discussion around if these probes as currently configured will have the desired effect when they start to fail.
readinessProbe
is NotReady
, it is removed from the Service
as a target - i.e. it will no longer receive traffic from a loadbalancer.readinessProbe
in a deployment become NotReady
, the entire Service
won't respond to requests at all - since it will have no members.airbyte-server
instances will fail their readinessProbe
checks; because the database will be down or latent for all airbyte-server
instances at the same time.readinessProbes
they will stop receiving traffic (but remain running)airbyte-server
instances are down, the API will stop responding.airbyte-webapp
also relies on the airbyte-server
's health checks via /api/v1/health
- airbyte-webapp
will fail its readinessProbe
checksairbyte-webapp
instances are down, the console is down -- there will be no opportunity to load any more html/scripts/etc...Is my analysis above correct? ( can't see the internal PRs linked above, so maybe some of this is wrong )
If so, are the above outcomes desirable? I'm still of the opinion that a dependent service shouldn't necessarily take itself offline if a dependency is unavailable; since in doing so it will give up the ability to return useful error messages and add back-pressure to downstream clients. I'd much rather an API call return something like 503 - Service Unavailable : Database not reachable
than for me to get a 502 - Bad Gateway
from the LB have to figure out what went wrong.
Tell us about the problem you're trying to solve
The heath check endpoint exposed by the server API runs a simple
SELECT 1
query to determine if the database connection is healthy. This endpoint is used both by K8 for liveness/readiness health checks, but also by the UI. Each open page in the UI makes this call, resulting in 1000's of requests per minute. This puts strain on the database connection pools in use in the server application and could lead to request queueing for non-heath check requests that interact with the database.Describe the solution you’d like
Memoize the result of the
SELECT 1
call for a period of time that is < than the liveness/readiness check interval to alleviate the number of calls to the database.Acceptance Criteria