Striveworks / valor

Valor is a centralized evaluation store which makes it easy to measure, explore, and rank model performance.
https://striveworks.github.io/valor/
Other
38 stars 4 forks source link

BUG: API reset can cause evaluations to get stuck with `RUNNING` status. #541

Open czaloom opened 6 months ago

czaloom commented 6 months ago

valor version checks

Reproducible Example

Restart server mid-computation but retain the database.

Issue Description

Evaluations in Valor are performed using a combination of python code in the backend and some postgresql queries. If the API is restarted mid-computation a desynchronization occurs between the API and the database.

Expected Behavior

Any evaluation stuck in a RUNNING status after a timeout threshold should be set to the FAILED status.

ntlind commented 6 months ago

Moving this one to backlog. The best fix here might be implementing a message broker.