TheRacetrack / racetrack

An opinionated framework for deploying, managing, and serving application workloads
https://theracetrack.github.io/racetrack/
Apache License 2.0
28 stars 6 forks source link

Cache PUB internal calls #213

Open iszulcdeepsense opened 1 year ago

iszulcdeepsense commented 1 year ago

On each Job call, PUB component makes a request to Lifecycle to check if a user is allowed to speak to a job and to get the internal address of the job to forward the request to. Lifecycle, in turn, makes calls to a database, which can perform bad under higher load.

To improve PUB's performance, these internal calls could be cached with a short time-to-live (eg. 1 minute). The composite key of this cache composes of: user auth token, job name, job version, job endpoint.

It can be cached either by PUB or by Lifecycle, though it makes more sense to cache it by PUB.

JosefAssadERST commented 1 year ago

which can perform bad under higher load

Has this actually been the case?

I'm not against caching at all, but I typically prefer to try as many other things as possible before. Off the top of my head:

When we start caching, we will pay a price in more difficult troubleshooting later, and easier to make mistakes which can cause bugs or security issues (e.g. anything touching user accounts will need to remember to update the cache also).

iszulcdeepsense commented 1 year ago

Has this actually been the case?

To be honest, it's just my conjecture. It's been proven that a single request to Racetrack v2 can be slower than v1 due to the external database. This idea is just a trick to mitigate this, but I don't think it's a problem right now.

At this point, let's leave it as it is. This issue can be even classified as a "premature optimization". Plus, I agree, this could make troubleshooting and testing more difficult.

JosefAssadERST commented 2 weeks ago

By the way if this really is such a big performnce problem, another way to do it without running into risk of race conditions and fiddling with TTL is for PUB to keep a copy of the auth table in memory, and when there's a change to the auth table, lifecycle pushes this change to PUB.

What's better than this is not doing this at all, but if there really is a demonstrable performance problem (and not just us being annoyed there's more calls when there's no real price to pay for those resources) then this model is more robust.

iszulcdeepsense commented 2 weeks ago

I like this idea as an effective mitigation of unstable database connections. In case of connection loss, Pub could stand the outages for a short time (if that would be the case to solve) by using in-memory cache. I think that improving availability, more than performance, would be a driving factor to make this issue happen one day.

anders314159 commented 2 weeks ago

It also sounds like a fun thing to implement.

JosefAssadERST commented 1 week ago

Bear in mind, I don't think you should do this unless the DB calls objectively are a problem.

I'm proposing a cache here. That has a high price in debuggability; getting a cche to work reliably is one of those things that seem simple enough but can actually trip you up in a lot of subtle ways. There's a lot of failure modes.

but if you HAVE to, this seems more reliable than caching DB calls.