Peddle commented 1 year ago

What is this?

Adds two fields to /__status__ endpoint: gpu_available: bool and sequence_number: int

gpu_available is true if the gpu lock is not locked. sequence_number increments by one every time a task acquires the GPU.

This PR also refactors how we're dealing with locks in potassium eliminating a few race conditions.

Why?

Using both of these, a load balancer can effectively determine the current state of a potassium deployment.

How did you test it works without regressions?

tests are passing + manual toy app is working as well (more thorough testing should be done with cover in prod before merging staging into main to ensure integration is safe)

If this is a new feature what may a critical error look like?

Things to consider to not repeat mistakes we've learned from many times

[ ] If critical errors fire do we ping team in some obvious way (e.g., slack)?
[ ] Are there debug logs + a way to see these logs if we need to debug?
[ ] Is this documented enough a dev could work on this code without getting stuck or having to ping you?

erik-dunteman commented 1 year ago

Overall:

status endpoint looks good, no qualms there
lock acquisition all good except the areas where we want to handle the background task done signal
Not sure we need to change from the previous queue design. It was intended to be certain to have an event per background task release, not affect the rest of the app (doesn't lock the global lock), and indicated as private to users.

erik-dunteman commented 1 year ago

did a call offline - approved contingent on changes and test clearance

bananaml / potassium

Feature/track gpu status #31

What is this?

Why?

How did you test it works without regressions?

If this is a new feature what may a critical error look like?

Things to consider to not repeat mistakes we've learned from many times