deltachat / chatmail

chatmail service deployment scripts and docs
https://delta.chat/en/2023-12-13-chatmail
MIT License
97 stars 5 forks source link

Setup monitoring #38

Open link2xt opened 8 months ago

link2xt commented 8 months ago

postqueue -j prints postfix queue in JSON, can be redirected into wc -l to count the number of messages and counted with RRDtool or some monitoring system based on it, but we can also use https://oss.oetiker.ch/rrdtool/prog/rrdpython.en.html directly.

missytake commented 7 months ago

un-assigning myself for now as I'm not really working on it.

link2xt commented 7 months ago

Prometheus has a text format for exposing metrics (formely Protocol Buffers format, but this is deprecated) which got "standardized" as OpenMetrics.

This text format can be exported via HTTP endpoint and read by Prometheus, then processed however you want.

This is what iroh project is doing, iroh node has an HTTP endpoint where the metrics can be read from and prometheus-client dependency crate is generating this text format.

If generating something that looks like OpenMetrics text format with Python and without dependencies is easy enough, we can also do this and collect metric readings into central Prometheus instance.

As long as we are only generating a text file and don't run prometheus itself or similar huge software on the chatmail it should be fine.

missytake commented 7 months ago

Here is an interesting guide on how to expose those metrics: https://www.oreilly.com/library/view/prometheus-up/9781098131135/ch04.html, ctrl+f for the headline "Text Exposition Format"

So we'd basically have cron job which daily collects numbers of messages and new accounts (anything more?), and writes them to a metrics file which we can expose under https://nine.testrun.org/metrics ?

Example:

# HELP yesterday_messages number of messages sent the day before we collected this metric
# TYPE yesterday_messages counter
# HELP yesterday_new_account number of new accounts created the day before we collected this metric
# TYPE yesterday_new_account counter
yesterday_messages{timestamp="1701969333549"} 23415
yesterday_new_accounts{timestamp="1701969333549"} 20
yesterday_messages{timestamp="1701969441417"} 34535
yesterday_new_accounts{timestamp="1701969441417"} 17

Am I getting this right?

hpk42 commented 7 months ago

On Thu, Dec 07, 2023 at 09:18 -0800, missytake wrote:

Here is an interesting guide on how to expose those metrics: https://www.oreilly.com/library/view/prometheus-up/9781098131135/ch04.html, ctrl+f for the headline "Text Exposition Format"

So we'd basically have cron job which daily collects numbers of messages and new accounts (anything more?), and writes them to a metrics file which we can expose under https://nine.testrun.org/metrics ?

that's a good start. Format does not matter that much (even a CSV would be fine IMO, and whenever we do a new measurement type, we just append a column). I'd go for per-minute measurements though, or at least not just per-day. "Zooming out" can always be done but having fine-grained data also allows to use measurements for real-time monitoring better.

FWIW just did a quick random 10-minute measurement script, producing this output:

***@***.*** /home/vmail/mail/nine.testrun.org # python3 /tmp/x.py
measurement probe at timestamp: 1702028115
------------------------------------------
000007 accounts yesterday
115029 accounts last month
123959 ci accounts existing currently
000192 non-ci accounts
124151 accounts overall
------------------------------------------
time it took to measure: 0.79 seconds

The script is here:

import pathlib
import time

accounts = ci_accounts = yesterday_accounts = lastmonth_accounts = 0 
current = time.time()
yesterday_mtime = current - 60*60*24
lastmonth_mtime = current - 60*60*24 * 30

for x in pathlib.Path().iterdir():
    cur = x.joinpath("cur")
    try:
        mtime = cur.stat().st_mtime
    except Exception: 
        pass
    else:
        accounts += 1
        if x.name.startswith("ci"):
            ci_accounts += 1
        if mtime > yesterday_mtime:
            yesterday_accounts += 1
        if mtime > lastmonth_mtime:
            lastmonth_accounts += 1

print(f"measurement probe at timestamp: {current:05.0f}")
print("------------------------------------------")
print(f"{yesterday_accounts:06d} accounts yesterday")
print(f"{lastmonth_accounts:06d} accounts last month")
print(f"{ci_accounts:06d} ci accounts existing currently")
print(f"{accounts - ci_accounts:06d} non-ci accounts")
print(f"{accounts:06d} accounts overall")
print("------------------------------------------")
print(f"time it took to measure: {time.time()-current:01.2f} seconds")
link2xt commented 7 months ago

<offtopic> Speaking about monitoring, might be interesting to add metrics to Delta Chat core and expose them for bots in this way. Then collect number of messages sent, received, number of contacts, number of chats, number of rows in msgs table, blobdir size, number of reconnects, number of IDLE timeouts etc. for all bots. How exactly to expose the metrics (via HTTP, by sending a mail somewhere etc.) is up to the bot, the core only needs to have a call providing a single text blob. </offtopic>

Not offtopic, I think it may be cool to send chatmail metrics outside via sendmail to a configured address rather than exposing them via HTTP.

link2xt commented 6 months ago

We have https://prometheus.testrun.org/ running now, https://prometheus.testrun.org/metrics can be used as a reference example for the format prometheus expects.

link2xt commented 6 months ago

Instead of scraping via HTTP prometheus has a support for pushing via pushgateway and README even has examples of pushing metrics with curl. So we can push whatever metrics from cronjobs without the need to write files accessible by the webserver.

Python client library also has pushing support: https://prometheus.github.io/client_python/exporting/pushgateway/

link2xt commented 6 months ago

Dovecot has a statistics module which allows to define metrics: https://doc.dovecot.org/configuration_manual/stats/

It can be read with doveadm stats dump or exported via HTTP as OpenMetrics: https://doc.dovecot.org/configuration_manual/stats/openmetrics/ OpenMetrics can only be exported via HTTP endpoint.