Open matclayton opened 7 years ago
Agreed, this seems like a sensible feature that's needed above and beyond the channel layer statistics interface. I've been a fan of the e.g. HaProxy stats info for a while.
Do you have any idea on what a preferred API would look like for these, we're working on an agsi layer which wraps the redis one and emits statsd data at the moment, but could divert resources to extending channels if there was a high likelihood of it being integrated.
(FYI we're replacing an existing custom built websocket system with channels which supports 10k's of connections, and have been really impressed with the api, we're still trying to shake out a few productionscaling bugs, but its looking much better than before! One issue we're seeing is booting up workers whilst the channel is full, they seem to crash out immediately with ChannelFull errors)
On 24 May 2017 at 17:37, Andrew Godwin notifications@github.com wrote:
Agreed, this seems like a sensible feature that's needed above and beyond the channel layer statistics interface. I've been a fan of the e.g. HaProxy stats info for a while.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/django/daphne/issues/118#issuecomment-303780651, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHtCooJY0d8uu3QHUSh-yHt-u0vynVyks5r9FzfgaJpZM4NkzxQ .
-- -- Matthew Clayton | Co-Founder Mixcloud Limited
mixcloud http://www.mixcloud.com/mat twitter http://www.twitter.com/matclayton
email mat@mixcloud.com mobile +44 7872007851 skype matclayton
Well, the main question is if it should be on the Daphne instance directly on a separate path/port, which is probably easiest to implement and scale but harder to aggregate, or sent over the channel layer to a specified channel, which would make it easy to get the stats from all servers but harder to ensure it works (as that channel may itself become full).
I think I'd prefer a special path served by the HTTP server inside Daphne itself that just returns a JSON document with the current stats, and then leave any aggregation up to the organisation implementing it (and e.g. cross-referencing with machine load graphs).
The worker bug sounds annoying - if you could open a separate issue with a full traceback of that I can see if we can fix it. Workers shouldn't hit ChannelFull in general, and if they do, they're meant to wait.
I think channel layer is preferred solution.
If it becomes full, daphne can send updated version of statistic data next time. So statistic will be lost only when we restart daphne process and it does not send the current portion of data yet.
Also, we can send this data on soft shutdown.
My problem with having it come over the channel layer is, how do you identify the hosts? This is a problem we faced setting up monitoring at work for some of our stuff; our systems can already interrogate things on a server directly, but it's harder to tie it back to a server from the channel layer.
Consul telemetry integration seems the easiest way to me. But it too tight into vendor software. Not everyone will be happy with it.
Yes, I'm not tying into something that tightly. This is why I think a HTTP endpoint on Daphne itself makes the most sense for the short term - it's already a HTTP server, and this is a pretty standard way to report metrics for HTTP handlers.
@matclayton @andrewgodwin do you know if any progress has been made against having a runtime metrics API in daphne?
I definitely need this for my app to understand & improve scaling and performance. I'm specifically looking to track:
-# of active web socket connections -# of messages waiting to be processed per channel -average delay between daphne receiving a message and it being handed off to a worker -average response times for handling messages (both the time for worker processing message and overall clock time from daphne receiving request to finish processing) -any capacity issues like channel full errors -a health check / heart beat to ensure things are overall fine with daphne (not exactly sure what i'd be diagnosing here)
What would be the best way to track and get these metrics? Is there enough information in redis to ascertain some of this information?
Or if additional information needs to be tracked to get at these stats during daphne processing, is capturing those stats in redis during message processing and then retrieving them on the monitor call a good approach?
I might be able to help contribute something with some guidance.
No progress has been made partially because Daphne has been heavily rewritten for the upcoming 2.0 release as the way applications have run has changed substantially. HTTP/WebSocket handling is done in process now, which means that half of the thins you ask for are now things that don't even exist as statistics - the only things we could service would be number of connections and a heartbeat of some kind.
OK, got it. Definitely looking for something on the 1.0 series given the Python 3 requirement on 2.0 will likely mean can't adopt it in the near term. But also probably doesn't make sense to try to contribute something here given the substantial upcoming changes.
I saw mentions of global_statistics() and channel_statistics(channel) in the docs. Have these been implemented and available today for the asgi_redis channel layer?
Those should work for asgi_redis, yes, we used them at work for a little while.
I was able to create a stats page leveraging channel_statistics(channel) for top inbound channels, which is great.
Any suggestions on ways to get at the # of active web socket connections?
The number of open connections is only stored within Daphne at the moment, you'd have to somehow get at the status dictionaries to see what it is. Unless you do separate counting in connect/disconnect handlers.
I am currently implementing something for this and will (hopefully) place a PR soon.
+1
My strong preference here would be an ASGI message type for server statistics/monitoring/performance, that includes information such as:
A really nice win would be to have an ASGI app that collates these messages and presents an HTTP interface for inspecting them. (ideally both at per-cluster and per-process levels)
The obvious benefit there would be to make sure that we're sharing work between daphne/uvicorn/hypercorn. Building that tooling at the ASGI level is also a nicer level of separation that I think would help wrt. maintainability and longevity of any tools we put together in this area.
(I'm not too sure how this interacts with channels and channels statistics, since I'm more motivated by the plain ASGI server case.)
Hm, I never thought of this as an ASGI thing as my use case would have been external monitoring and healthchecks, and even if you wrote an app that collected messages and presented an API it would still need some kind of persistence to keep the data around.
Sure, I was more thinking that whether you log metrics to standard logging, a metrics file, statds, or whatever else, it'd be more adaptable if you push the stats out through ASGI, and then have a small app that deals with how to log/collate/present them.
(Plus it makes it easier to then build shared interfaces across daphne/uvicorn/hypercorn, and makes it easier to build metrics displays that present information from across a whole cluster, or allow digging into a single process.)
In any case, probably better to start with a proposed implementation (or clearly defined set of metrics you'd like to see) and take things from there.
I was testing out something like this:
Internet -> Daphne -> Metrics logger ASGI app -> Real ASGI app
|
\|/
Plugin interface to handle metrics
And it struck me that I'd never actually log stats in Daphne because I'd always run a reverse proxy in front and handle it like this.
Internet -> nginx -> Daphne -> Real ASGI app
|
\|/
Log metrics here
The only thing lacking here would be any background tasks running in the Real ASGI app after the request is finished.
Is it a common to expose Daphne directly to the internet?
Some people do, sure, but I imagine it's not everyone. This ticket is not meant to be background logging, anyway, it's meant to be a simple HTTP API that Daphne serves optionally that lets you see internal stats like current request numbers etc.
Something for prometheus to scrape would be excellent. django-prometheus surrounds the middleware to collect stats. Is there a way to surround the URLRouter in a middleware that just collects stats?
I have a set up for a minimal prometheus monitoring of uvicorn in an application..
It started with us having problems that daphne locked up from time to time with no way to introspect what was going on inside of it. In the process we switched to uvicorn and have not had any problems since (could also be related to other upgrades). I don't remember if I had to patch something inside Daphne to get it running with it or now.
It provides prometheus metrics for number of active tasks and received requests and a simple /tasks page which dumps asyncio task info for all active tasks. There might be race conditions since the asyncio loop is read from another thread (not sure what exactly the rules are) but we have not noticed any problems with it in production yet.
It's more of a stepping stone than something feature complete but it allows us to monitor our django channels system for warnings signs (growing asyncio tasks and/or open file descriptors) and request rates
# this file can be used with daphne to laucn with metrics exporter and debuger http resources,
import asyncio
import logging
import os
import django
from channels.routing import get_default_application
from foobar.asgi_server import httpserver
from foobar.asgi_server.applications import MetricsApp
if "PROMETHEUS_PORT_RANGE" in os.environ:
del os.environ["PROMETHEUS_PORT_RANGE"]
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "foobar.settings")
django.setup()
logger = logging.getLogger(__name__)
httpserver.start_server(loop=asyncio.get_event_loop())
application = MetricsApp(get_default_application())
import threading
import time
from http.server import HTTPServer
import prometheus_client
from foobar.asgi_server import prom
logger = logging.getLogger(__name__)
def start_server(*, loop, addr="0.0.0.0", port_range=range(8001, 8065)):
# note that asyncio has to have been initialzied at this point so that
# asyncio.get_event_loop() returns the correct loop.
MetricsHandler.loop = loop
for port in port_range:
try:
httpd = HTTPServer((addr, port), MetricsHandler)
except (OSError, socket.error):
continue # Try next port
thread = PrometheusEndpointServer(httpd)
thread.daemon = True
thread.start()
logger.info("Exporting Prometheus /metrics/ on port %s" % port)
return
class MetricsHandler(prometheus_client.MetricsHandler):
loop = None
next_update = None
def do_GET(self):
if self.next_update is None or time.monotonic() > self.next_update:
prom.asyncio_tasks.set(len(asyncio.all_tasks(loop=self.loop)))
self.next_update = time.monotonic() + 1
if self.path == "/tasks":
self.send_response(200)
self.send_header("Content-Type", "text/plain; version=0.0.4; charset=utf-8")
self.end_headers()
for v in asyncio.all_tasks(loop=self.loop):
done = v.done()
cancelled = v.cancelled()
tid = id(v)
ss = io.StringIO()
ss.write(f"--------------- id:{tid} done:{done} cancelled:{cancelled}\n\n")
try:
exc = v.exception()
if v is not None:
ss.write("-- Exception: \n")
print(exc, file=ss)
ss.write("\n")
except asyncio.CancelledError:
pass
except asyncio.InvalidStateError:
pass
ss.write("-- Traceback: \n")
v.print_stack(file=ss)
self.wfile.write(ss.getvalue().encode())
self.wfile.write("\n\n\n".encode())
else:
super().do_GET()
class PrometheusEndpointServer(threading.Thread):
"""A thread class that holds an http and makes it serve_forever()."""
def __init__(self, httpd, *args, **kwargs):
self.httpd = httpd
super().__init__(*args, **kwargs)
def run(self):
self.httpd.serve_forever()
from prometheus_client import Counter, Gauge
req_recv = Counter("asgi_received_requests", "", ["type", "method"])
asyncio_tasks = Gauge("asyncio_active_tasks", "")
import uvicorn.config
import uvicorn.loops.uvloop
import uvicorn.main
def patch():
"""Patch out uvicorns async io loop setup"""
uvicorn.config.LOOP_SETUPS["auto"] = "foobar.asgi_server.uvicorn_utils:_fake_setup"
uvicorn.config.LOOP_SETUPS["uvloop"] = "foobar.asgi_server.uvicorn_utils:_fake_setup"
def _fake_setup():
"""noop to prohibit uvicorn doing stuff to the event loop"""
pass
def uvloop_setup():
"""Run uvicorns uvloop setup."""
uvicorn.loops.uvloop.uvloop_setup()
def start():
uvicorn.main.main()
import logging
from foobar.asgi_server import prom
logger = logging.getLogger(__name__)
class MetricsApp:
# An asgi application that exports very basic metrics using prometheus
def __init__(self, parent):
self.parent = parent
def __call__(self, scope):
prom.req_recv.labels(scope.get("type", ""), scope.get("method", "")).inc()
# logger.error(f"{scope}")
res = self.parent.__call__(scope)
return res
example from a staging environment. We also monitor the load balancer (traefik) so we can compare number of open connections externally to asyncio active tasks and file descriptors (because we are looking for indications of server locking up)
I know my project could use something like this. Can we discuss the stability of the current design approaches?
Do the examples provided work well enough and are modifiable for other's use cases?
I would be happy it Daphne would include such prometheus exporter
It would be nice if Daphne could provide an api for runtime metrics for things like messages/active connections.
I imagine this could be built from 2 api's, one providing a basic http api we could hit and get the current active number of connections (in our case websocket) for that daphne instance.
Secondly a plugin/interface we could hook into to emit runtime events, we'd like to emit statsd events for connections opened/messages sent/closed etc, but I would imagine there are other metric systems people would use, so a generic stats interface / plugin system would be better than embedding statsd into daphne directly.