Events: Stats sometimes too slow.

bernhardreiter commented 7 years ago

When running a query over a full month it may take several minutes (up to 10) on a test-system we are running which has a couple of million events in that timeframe (e.g. 25M events).

One reason is that the SQL query sometimes uses the index and sometimes it does not.

bernhardreiter commented 4 years ago

On IntelMQ-CB-Mailgen installations with large databases, we see that the stats calls are slow. Initial calls of PostgreSQL's analyse explain have shown that indices are not used in many cases.

Typical is to first have a time frame, which gives a very large result set and then more criteria, for which no index is used anymore, because the indexes used to far to not have information that can be used to apply a second index on a result set of the first.

Because the other search criteria also matching often, it also does not make sense with many queries to do the other criteria first because the intermediate result set would also be large leading to the same problem.

The idea is to change the structure of the database to allow indexes to be applied after each other. At least the advantage of the time range should be used.

bernhardreiter commented 4 years ago

If apache2 is used as https Server, we can add the time it takes to serve a request to the LogFormat: https://github.com/Intevation/intelmq-fody-backend/commit/e431153520d045fa62519b24e5835e06a2bbdddd

bernhardreiter commented 4 years ago

If apache is used to log stuff, the relevant log files are

/var/log/apache2/fody-backend-access.log
/var/log/apache2/error.log which has the sql queries, if at least INFO is activated as loglevel. And if some request parameters are transfered in the body, this file has to be used to see the details.

Here some commands that can help to count certain requests, here with parameters in the url:

zgrep --only-matching "GET /api/events/search?[^ ]*" fody-backend-access.log*

or only requests not made by the client python-requests:

zgrep --invert-match --count  python-requests fody-backend-access.lo* | sort

bernhardreiter commented 4 years ago

Analysis

One problem seems to be that many queries will use the time.observation field as a limiting parameter. As it is possible that events come in freshly with time.observation being well in the past, the database will have to consider all entries. It would be interesting to know how big the typical gap is in a large database. And it would be interesting to know if users could define a limit when it does not matter to them anymore if time.observation is in the past for x days in a freshly incoming event.

This is likely to have an impact as the PostgreSQL CLUSTER can https://www.postgresql.org/docs/current/sql-cluster.html physically reorder a table matching an index. There are also partial indexes.

estimate of table size

As a real select count(*) from events; can be really slow, we can estimate the size by

SELECT reltuples::bigint AS estimate
FROM   pg_class
WHERE  oid = to_regclass('public.events');

understand where time is spend

long queries can be understood better by issuing as statement like the following:

EXPLAIN (ANALYZE, BUFFERS) statement

bernhardreiter commented 4 years ago

Fine tuning

Measurements with EXPLAIN and atop showed that there was room for finetuning of postgresql 9.5, this helps to adapt to the memory size and the type of storage (e.g. fast SSD or something else).

One source of optimisation ideas is https://pgtune.leopard.in.ua . Another the documentation of PostgreSQL, e.g. https://www.postgresql.org/docs/9.5/runtime-config-query.html#GUC-RANDOM-PAGE-COST and https://www.postgresql.org/docs/9.5/runtime-config-resource.html#GUC-EFFECTIVE-IO-CONCURRENCY.

Looking at logs

The following commands from the https://en.wikipedia.org/wiki/The_Unix_Programming_Environment may help to look at the fody-logfiles to see how many request are taking how long

zgrep --invert-match  python-requests fody-backend-access.log.2.gz |  \
 sed 's/\(^.*\) \([0-9]*$\)/\2 \1/' | \
 awk '{ if ($1 > 1000) print $0 }' | \
 sort -n

Short explanation: The sed replaces the value of milliseconds to the front of the row, the awk only prints out lines that have more than a second.

Intevation / intelmq-fody-backend