django / daphne

Django Channels HTTP/WebSocket server
BSD 3-Clause "New" or "Revised" License
2.4k stars 268 forks source link

Runtime Metrics API #118

Open matclayton opened 7 years ago

matclayton commented 7 years ago

It would be nice if Daphne could provide an api for runtime metrics for things like messages/active connections.

I imagine this could be built from 2 api's, one providing a basic http api we could hit and get the current active number of connections (in our case websocket) for that daphne instance.

Secondly a plugin/interface we could hook into to emit runtime events, we'd like to emit statsd events for connections opened/messages sent/closed etc, but I would imagine there are other metric systems people would use, so a generic stats interface / plugin system would be better than embedding statsd into daphne directly.

andrewgodwin commented 7 years ago

Agreed, this seems like a sensible feature that's needed above and beyond the channel layer statistics interface. I've been a fan of the e.g. HaProxy stats info for a while.

matclayton commented 7 years ago

Do you have any idea on what a preferred API would look like for these, we're working on an agsi layer which wraps the redis one and emits statsd data at the moment, but could divert resources to extending channels if there was a high likelihood of it being integrated.

(FYI we're replacing an existing custom built websocket system with channels which supports 10k's of connections, and have been really impressed with the api, we're still trying to shake out a few productionscaling bugs, but its looking much better than before! One issue we're seeing is booting up workers whilst the channel is full, they seem to crash out immediately with ChannelFull errors)

On 24 May 2017 at 17:37, Andrew Godwin notifications@github.com wrote:

Agreed, this seems like a sensible feature that's needed above and beyond the channel layer statistics interface. I've been a fan of the e.g. HaProxy stats info for a while.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/django/daphne/issues/118#issuecomment-303780651, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHtCooJY0d8uu3QHUSh-yHt-u0vynVyks5r9FzfgaJpZM4NkzxQ .

-- -- Matthew Clayton | Co-Founder Mixcloud Limited

mixcloud http://www.mixcloud.com/mat twitter http://www.twitter.com/matclayton

email mat@mixcloud.com mobile +44 7872007851 skype matclayton

andrewgodwin commented 7 years ago

Well, the main question is if it should be on the Daphne instance directly on a separate path/port, which is probably easiest to implement and scale but harder to aggregate, or sent over the channel layer to a specified channel, which would make it easy to get the stats from all servers but harder to ensure it works (as that channel may itself become full).

I think I'd prefer a special path served by the HTTP server inside Daphne itself that just returns a JSON document with the current stats, and then leave any aggregation up to the organisation implementing it (and e.g. cross-referencing with machine load graphs).

The worker bug sounds annoying - if you could open a separate issue with a full traceback of that I can see if we can fix it. Workers shouldn't hit ChannelFull in general, and if they do, they're meant to wait.

proofit404 commented 7 years ago

I think channel layer is preferred solution.

If it becomes full, daphne can send updated version of statistic data next time. So statistic will be lost only when we restart daphne process and it does not send the current portion of data yet.

Also, we can send this data on soft shutdown.

andrewgodwin commented 7 years ago

My problem with having it come over the channel layer is, how do you identify the hosts? This is a problem we faced setting up monitoring at work for some of our stuff; our systems can already interrogate things on a server directly, but it's harder to tie it back to a server from the channel layer.

proofit404 commented 7 years ago

Consul telemetry integration seems the easiest way to me. But it too tight into vendor software. Not everyone​ will be happy with it.

andrewgodwin commented 7 years ago

Yes, I'm not tying into something that tightly. This is why I think a HTTP endpoint on Daphne itself makes the most sense for the short term - it's already a HTTP server, and this is a pretty standard way to report metrics for HTTP handlers.

sachinrekhi commented 7 years ago

@matclayton @andrewgodwin do you know if any progress has been made against having a runtime metrics API in daphne?

I definitely need this for my app to understand & improve scaling and performance. I'm specifically looking to track:

-# of active web socket connections -# of messages waiting to be processed per channel -average delay between daphne receiving a message and it being handed off to a worker -average response times for handling messages (both the time for worker processing message and overall clock time from daphne receiving request to finish processing) -any capacity issues like channel full errors -a health check / heart beat to ensure things are overall fine with daphne (not exactly sure what i'd be diagnosing here)

What would be the best way to track and get these metrics? Is there enough information in redis to ascertain some of this information?

Or if additional information needs to be tracked to get at these stats during daphne processing, is capturing those stats in redis during message processing and then retrieving them on the monitor call a good approach?

I might be able to help contribute something with some guidance.

andrewgodwin commented 7 years ago

No progress has been made partially because Daphne has been heavily rewritten for the upcoming 2.0 release as the way applications have run has changed substantially. HTTP/WebSocket handling is done in process now, which means that half of the thins you ask for are now things that don't even exist as statistics - the only things we could service would be number of connections and a heartbeat of some kind.

sachinrekhi commented 7 years ago

OK, got it. Definitely looking for something on the 1.0 series given the Python 3 requirement on 2.0 will likely mean can't adopt it in the near term. But also probably doesn't make sense to try to contribute something here given the substantial upcoming changes.

I saw mentions of global_statistics() and channel_statistics(channel) in the docs. Have these been implemented and available today for the asgi_redis channel layer?

andrewgodwin commented 7 years ago

Those should work for asgi_redis, yes, we used them at work for a little while.

sachinrekhi commented 7 years ago

I was able to create a stats page leveraging channel_statistics(channel) for top inbound channels, which is great.

Any suggestions on ways to get at the # of active web socket connections?

andrewgodwin commented 6 years ago

The number of open connections is only stored within Daphne at the moment, you'd have to somehow get at the status dictionaries to see what it is. Unless you do separate counting in connect/disconnect handlers.

spielmannj commented 6 years ago

I am currently implementing something for this and will (hopefully) place a PR soon.

andrewfader commented 6 years ago

+1

tomchristie commented 6 years ago

My strong preference here would be an ASGI message type for server statistics/monitoring/performance, that includes information such as:

A really nice win would be to have an ASGI app that collates these messages and presents an HTTP interface for inspecting them. (ideally both at per-cluster and per-process levels)

The obvious benefit there would be to make sure that we're sharing work between daphne/uvicorn/hypercorn. Building that tooling at the ASGI level is also a nicer level of separation that I think would help wrt. maintainability and longevity of any tools we put together in this area.

(I'm not too sure how this interacts with channels and channels statistics, since I'm more motivated by the plain ASGI server case.)

andrewgodwin commented 6 years ago

Hm, I never thought of this as an ASGI thing as my use case would have been external monitoring and healthchecks, and even if you wrote an app that collected messages and presented an API it would still need some kind of persistence to keep the data around.

tomchristie commented 6 years ago

Sure, I was more thinking that whether you log metrics to standard logging, a metrics file, statds, or whatever else, it'd be more adaptable if you push the stats out through ASGI, and then have a small app that deals with how to log/collate/present them.

(Plus it makes it easier to then build shared interfaces across daphne/uvicorn/hypercorn, and makes it easier to build metrics displays that present information from across a whole cluster, or allow digging into a single process.)

In any case, probably better to start with a proposed implementation (or clearly defined set of metrics you'd like to see) and take things from there.

JohnDoee commented 6 years ago

I was testing out something like this:

Internet -> Daphne -> Metrics logger ASGI app -> Real ASGI app
                              |
                             \|/
             Plugin interface to handle metrics

And it struck me that I'd never actually log stats in Daphne because I'd always run a reverse proxy in front and handle it like this.

Internet -> nginx -> Daphne -> Real ASGI app
              |
             \|/
        Log metrics here

The only thing lacking here would be any background tasks running in the Real ASGI app after the request is finished.

Is it a common to expose Daphne directly to the internet?

andrewgodwin commented 6 years ago

Some people do, sure, but I imagine it's not everyone. This ticket is not meant to be background logging, anyway, it's meant to be a simple HTTP API that Daphne serves optionally that lets you see internal stats like current request numbers etc.

eselkin commented 5 years ago

Something for prometheus to scrape would be excellent. django-prometheus surrounds the middleware to collect stats. Is there a way to surround the URLRouter in a middleware that just collects stats?

thomasf commented 5 years ago

I have a set up for a minimal prometheus monitoring of uvicorn in an application..

It started with us having problems that daphne locked up from time to time with no way to introspect what was going on inside of it. In the process we switched to uvicorn and have not had any problems since (could also be related to other upgrades). I don't remember if I had to patch something inside Daphne to get it running with it or now.

It provides prometheus metrics for number of active tasks and received requests and a simple /tasks page which dumps asyncio task info for all active tasks. There might be race conditions since the asyncio loop is read from another thread (not sure what exactly the rules are) but we have not noticed any problems with it in production yet.

It's more of a stepping stone than something feature complete but it allows us to monitor our django channels system for warnings signs (growing asyncio tasks and/or open file descriptors) and request rates

run_uvicorn.py (launcher wrapper)

# this file can be used with daphne to laucn with metrics exporter and debuger http resources,
import asyncio
import logging
import os

import django
from channels.routing import get_default_application

from foobar.asgi_server import httpserver
from foobar.asgi_server.applications import MetricsApp

if "PROMETHEUS_PORT_RANGE" in os.environ:
    del os.environ["PROMETHEUS_PORT_RANGE"]

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "foobar.settings")

django.setup()

logger = logging.getLogger(__name__)

httpserver.start_server(loop=asyncio.get_event_loop())

application = MetricsApp(get_default_application())

httpserver.py

import threading
import time
from http.server import HTTPServer

import prometheus_client

from foobar.asgi_server import prom

logger = logging.getLogger(__name__)

def start_server(*, loop, addr="0.0.0.0", port_range=range(8001, 8065)):
    # note that asyncio has to have been initialzied at this point so that
    # asyncio.get_event_loop() returns the correct loop.
    MetricsHandler.loop = loop
    for port in port_range:
        try:
            httpd = HTTPServer((addr, port), MetricsHandler)
        except (OSError, socket.error):
            continue  # Try next port
        thread = PrometheusEndpointServer(httpd)
        thread.daemon = True
        thread.start()
        logger.info("Exporting Prometheus /metrics/ on port %s" % port)
        return

class MetricsHandler(prometheus_client.MetricsHandler):
    loop = None
    next_update = None

    def do_GET(self):
        if self.next_update is None or time.monotonic() > self.next_update:
            prom.asyncio_tasks.set(len(asyncio.all_tasks(loop=self.loop)))
            self.next_update = time.monotonic() + 1
        if self.path == "/tasks":
            self.send_response(200)
            self.send_header("Content-Type", "text/plain; version=0.0.4; charset=utf-8")
            self.end_headers()
            for v in asyncio.all_tasks(loop=self.loop):
                done = v.done()
                cancelled = v.cancelled()
                tid = id(v)
                ss = io.StringIO()
                ss.write(f"---------------  id:{tid}  done:{done}  cancelled:{cancelled}\n\n")
                try:
                    exc = v.exception()
                    if v is not None:
                        ss.write("-- Exception: \n")
                        print(exc, file=ss)
                        ss.write("\n")
                except asyncio.CancelledError:
                    pass
                except asyncio.InvalidStateError:
                    pass
                ss.write("-- Traceback: \n")
                v.print_stack(file=ss)
                self.wfile.write(ss.getvalue().encode())
                self.wfile.write("\n\n\n".encode())
        else:
            super().do_GET()

class PrometheusEndpointServer(threading.Thread):
    """A thread class that holds an http and makes it serve_forever()."""

    def __init__(self, httpd, *args, **kwargs):
        self.httpd = httpd
        super().__init__(*args, **kwargs)

    def run(self):
        self.httpd.serve_forever()

prom.py

from prometheus_client import Counter, Gauge

req_recv = Counter("asgi_received_requests", "", ["type", "method"])
asyncio_tasks = Gauge("asyncio_active_tasks", "")

uvicorn_utils.py (not really needed IIRC)

import uvicorn.config
import uvicorn.loops.uvloop
import uvicorn.main

def patch():
    """Patch out uvicorns async io loop setup"""
    uvicorn.config.LOOP_SETUPS["auto"] = "foobar.asgi_server.uvicorn_utils:_fake_setup"
    uvicorn.config.LOOP_SETUPS["uvloop"] = "foobar.asgi_server.uvicorn_utils:_fake_setup"

def _fake_setup():
    """noop to prohibit uvicorn doing stuff to the event loop"""
    pass

def uvloop_setup():
    """Run uvicorns uvloop setup."""
    uvicorn.loops.uvloop.uvloop_setup()

def start():
    uvicorn.main.main()

applications.py

import logging

from foobar.asgi_server import prom

logger = logging.getLogger(__name__)

class MetricsApp:
    # An asgi application that exports very basic metrics using prometheus
    def __init__(self, parent):
        self.parent = parent

    def __call__(self, scope):
        prom.req_recv.labels(scope.get("type", ""), scope.get("method", "")).inc()
        # logger.error(f"{scope}")
        res = self.parent.__call__(scope)
        return res

example from a staging environment. We also monitor the load balancer (traefik) so we can compare number of open connections externally to asyncio active tasks and file descriptors (because we are looking for indications of server locking up)

image

jheld commented 4 years ago

I know my project could use something like this. Can we discuss the stability of the current design approaches?

Do the examples provided work well enough and are modifiable for other's use cases?

tapionx commented 3 years ago

I would be happy it Daphne would include such prometheus exporter