abraham-ai / eden

Eden converts your python function into a hosted endpoint with minimal changes to your existing code :mage_man:
GNU General Public License v3.0
47 stars 5 forks source link

add /health as a way to check overall system health/status #13

Closed Mayukhdeb closed 2 years ago

Mayukhdeb commented 3 years ago

/health could be a way to check the following metrics:

There might be other useful metrics which I might be missing out, but the idea is to use these as a way to conditionally scale up/down in kubernetes.

This is how it could be:

client.check_health()

would return:

{
    'num_queued': 3,
    'num_running': 4,
    'num_failed': 1,
    'num_complete': 9,
    'num_gpus': 4,
    'usage': {
        'cpu': 0.4,
        'mem': 0.3,
        'gpu': {
            0: 0.3,
            1: 0.5,
            2: 0.4,
            3: 0.2,
        },
    }
}

cc: @genekogan @one1zero1one

one1zero1one commented 3 years ago

[edit] Streamlined suggestion and moved unnecessarily rant to a different place.

Eden can benefit from a status endpoint that returns a json - can be consumed by frontend for extended functionality.

But for operational reasons please also consider instrumenting with prometheus.

This will automatically create a simple http /metrics http endpoint, that can be consumed as-is or scraped by prometheus, allowing basic timeseries visualisation and further integrations with cloud native products.

Mayukhdeb commented 2 years ago

Update: we have exposed the some of these metrics via prometheus as seen here. The catch is that its exposed on /metrics and not /health. We can consider this issue to be fixed as of https://github.com/abraham-ai/eden/commit/c20af9b259f2c5b32d0f6856cfd158a253d03950