airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.3k stars 4.15k forks source link

Memoize Database Call in Health Check #25916

Open jdpgrailsdev opened 1 year ago

jdpgrailsdev commented 1 year ago

Tell us about the problem you're trying to solve

The heath check endpoint exposed by the server API runs a simple SELECT 1 query to determine if the database connection is healthy. This endpoint is used both by K8 for liveness/readiness health checks, but also by the UI. Each open page in the UI makes this call, resulting in 1000's of requests per minute. This puts strain on the database connection pools in use in the server application and could lead to request queueing for non-heath check requests that interact with the database.

Describe the solution you’d like

Memoize the result of the SELECT 1 call for a period of time that is < than the liveness/readiness check interval to alleviate the number of calls to the database.

Acceptance Criteria

davinchia commented 1 year ago

@timroes FE questions:

timroes commented 1 year ago

@davinchia The UI hits this endpoint every 20s and if it fails 3 times it will simply show a "server not reachable" toast notification at the bottom. It does not pause queries or limit anything else. I'm honestly not sure if that's still something we should be doing at all. The user will anyway notice the server is not reachable once their next actual request fails, and seeing that banner usually won't prevent them from actually continue using the app and anyway run into other erroneous requests. I though don't have the background information why this check was build in the first place and if there's maybe any reason this is happening.

jdpgrailsdev commented 1 year ago

Examples of usage of Micronaut caching:

https://github.com/airbytehq/airbyte-platform-internal/blob/master/oss/airbyte-config/init/src/main/java/io/airbyte/config/init/RemoteDefinitionsProvider.java#L40 https://github.com/airbytehq/airbyte-platform-internal/blob/master/oss/airbyte-config/init/src/main/java/io/airbyte/config/init/RemoteDefinitionsProvider.java#L113

More details here

davinchia commented 1 year ago

@cgardens do you remember why the webapp checks the health endpoint? Is it so OSS users know when to restart a 'stuck' OSS server?

Considering getting rid of this as it creates non-trivial DB load. Gathering context now.

timroes commented 1 year ago

We just merged a PR that removes the health check in the FE for cloud: https://github.com/airbytehq/airbyte-platform-internal/pull/6694 Once that is deployed I hope we'll see way less load on this API.

cgardens commented 1 year ago

My context:

davinchia commented 1 year ago

Yes, the OSS PR was partially what sparked this conversation.

Some history - Justen's last comment was something we ran into a year ago on Cloud. We had a connection leak and the server's connection pool was entirely used up leaving it unresponsive to the UI and jobs. This caused a fairly big P0. Back that, the health endpoint did not check the database and only returned a 200 once the server is started up. The simplest fix was to make the health endpoint includes the database. We have not touched this since.

Although I agree on the general principle of readiness not including downstream dependencies, I don't think it applies as much to us here. The argument there is to prevent downtime amplification across the system and is more relevant across a large microservices architecture. We only have one layer - so this effect is minimal. Further, the server does require the database as a hard dependency as currently architected. We could make it return a 503 instead of being removed from service, however I feel like there is little value there.

Agree on the point of separating out the readiness/liveness endpoints. I would say that healthy = readiness so would push for them to be equivalent.

justenwalker commented 1 year ago

Hey folks! Author of airbytehq/airbyte-platform#221 here

Thank you @jdpgrailsdev and @davinchia for giving this some attention. I've personally had outages become worse because I was visiting the Console to find out what was going on and the constant stream of health checks from the UI started to slow the DB down even further. I think its a great idea to re-evaulate if these check are necessary and/or speed them up if so.

With regards to the OSS PR (ie: what should be tested in each probe) - I think it might be helpful to put aside what documented best practices are and even what it means for AirByte to be "ready", and instead focus on what the outcomes will be when probe failures happen. I hope the following may be able to spark some discussion around if these probes as currently configured will have the desired effect when they start to fail.

Readiness (Health) Check Observations

  1. If a pod's readinessProbe is NotReady, it is removed from the Service as a target - i.e. it will no longer receive traffic from a loadbalancer.
  2. If all pods's readinessProbe in a deployment become NotReady, the entire Service won't respond to requests at all - since it will have no members.
  3. In the case of database unavailability or high latency - all airbyte-server instances will fail their readinessProbe checks; because the database will be down or latent for all airbyte-server instances at the same time.

Readiness (Health) Check Outcomes

  1. Since they fail readinessProbes they will stop receiving traffic (but remain running)
  2. Since all airbyte-server instances are down, the API will stop responding.
  3. Since airbyte-webapp also relies on the airbyte-server's health checks via /api/v1/health - airbyte-webapp will fail its readinessProbe checks
  4. Since airbyte-webapp instances are down, the console is down -- there will be no opportunity to load any more html/scripts/etc...
  5. New visitors to the site will receive a blank page or an error of some kind (and not a nice error, probably a 502)
  6. Existing visitors will likely have a very strange experience with whatever the current page and existing JS code can do without a responding API - a refresh will wipe that out (make them "new" visitors).

Questions

Is my analysis above correct? ( can't see the internal PRs linked above, so maybe some of this is wrong ) If so, are the above outcomes desirable? I'm still of the opinion that a dependent service shouldn't necessarily take itself offline if a dependency is unavailable; since in doing so it will give up the ability to return useful error messages and add back-pressure to downstream clients. I'd much rather an API call return something like 503 - Service Unavailable : Database not reachable than for me to get a 502 - Bad Gateway from the LB have to figure out what went wrong.