Open bernhardreiter opened 7 years ago
On IntelMQ-CB-Mailgen installations with large databases, we see that the stats calls are slow. Initial calls of PostgreSQL's analyse explain
have shown that indices are not used in many cases.
Typical is to first have a time frame, which gives a very large result set and then more criteria, for which no index is used anymore, because the indexes used to far to not have information that can be used to apply a second index on a result set of the first.
Because the other search criteria also matching often, it also does not make sense with many queries to do the other criteria first because the intermediate result set would also be large leading to the same problem.
The idea is to change the structure of the database to allow indexes to be applied after each other. At least the advantage of the time range should be used.
If apache2 is used as https Server, we can add the time it takes to serve a request to the LogFormat: https://github.com/Intevation/intelmq-fody-backend/commit/e431153520d045fa62519b24e5835e06a2bbdddd
If apache is used to log stuff, the relevant log files are
/var/log/apache2/fody-backend-access.log
/var/log/apache2/error.log
which has the sql queries, if at least INFO is activated as loglevel. And if some request parameters are transfered in the body, this file has to be used to see the details.Here some commands that can help to count certain requests, here with parameters in the url:
zgrep --only-matching "GET /api/events/search?[^ ]*" fody-backend-access.log*
or only requests not made by the client python-requests:
zgrep --invert-match --count python-requests fody-backend-access.lo* | sort
One problem seems to be that many queries will use the time.observation
field as a limiting parameter. As it is possible that events come in freshly with time.observation
being well in the past, the database will have to consider all entries. It would be interesting to know how big the typical gap is in a large database. And it would be interesting to know if users could define a limit when it does not matter to them anymore if time.observation
is in the past for x days in a freshly incoming event.
This is likely to have an impact as the PostgreSQL CLUSTER
can https://www.postgresql.org/docs/current/sql-cluster.html physically reorder a table matching an index. There are also partial indexes.
As a real select count(*) from events;
can be really slow, we can estimate the size by
SELECT reltuples::bigint AS estimate
FROM pg_class
WHERE oid = to_regclass('public.events');
long queries can be understood better by issuing as statement like the following:
EXPLAIN (ANALYZE, BUFFERS) statement
Measurements with EXPLAIN and atop showed that there was room for finetuning of postgresql 9.5, this helps to adapt to the memory size and the type of storage (e.g. fast SSD or something else).
One source of optimisation ideas is https://pgtune.leopard.in.ua . Another the documentation of PostgreSQL, e.g. https://www.postgresql.org/docs/9.5/runtime-config-query.html#GUC-RANDOM-PAGE-COST and https://www.postgresql.org/docs/9.5/runtime-config-resource.html#GUC-EFFECTIVE-IO-CONCURRENCY.
The following commands from the https://en.wikipedia.org/wiki/The_Unix_Programming_Environment may help to look at the fody-logfiles to see how many request are taking how long
zgrep --invert-match python-requests fody-backend-access.log.2.gz | \
sed 's/\(^.*\) \([0-9]*$\)/\2 \1/' | \
awk '{ if ($1 > 1000) print $0 }' | \
sort -n
Short explanation: The sed
replaces the value of milliseconds to the front of the row, the awk
only prints out lines that have more than a second.
When running a query over a full month it may take several minutes (up to 10) on a test-system we are running which has a couple of million events in that timeframe (e.g. 25M events).
One reason is that the SQL query sometimes uses the index and sometimes it does not.