coroot / coroot-pg-agent

A Prometheus exporter for Postgres focusing on query performance statistics
https://coroot.com/docs/metrics/pg-agent
Apache License 2.0
120 stars 15 forks source link

High CPU load caused by coroot-pg-agent #11

Closed alzamonn closed 1 year ago

alzamonn commented 1 year ago

Environmental Info: CentOS Linux release 7.9.2009 (Core) 3.10.0-1160.76.1.el7.x86_64 PostgreSQL 9.6.24 pg_stat_statements 1.4

RHEL7.9 5.4.17-2136.312.3.4.el7uek.x86_64 PostgreSQL 12.11 pg_stat_statements 1.7

Describe the bug: We have few postgres instance where we seeing high CPU load caused by the process coroot-pg-agent. In some situations, the server crashes. After coroot-pg-agent is stopped load on the processor immediately decreases. When we start coroot-pg-agent after about a minute, the load increases.

node with 4 cpu:

top - 14:06:42 up 1 day,  3:49,  2 users,  load average: 6.97, 6.86, 6.82
Tasks: 303 total,   3 running, 300 sleeping,   0 stopped,   0 zombie
%Cpu(s): 98.6 us,  1.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 26586992 total,  1309816 free, 11908980 used, 13368196 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used.  7388264 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1186 pgagent+  20   0 2400048   1.5g   5336 R 366.2  5.8   5696:09 coroot-pg-agent

Service coroot-pg-agent running via systemd unit.

[Service]
User=pgagent-exporter
Group=pgagent-exporter

ExecStart=/usr/bin/coroot-pg-agent 
--listen=:38888 
--scrape-interval=30s 
postgresql://user:pass@localhost:5432/postgres?connect_timeout=5&statement_timeout=30000&sslmode=disable

Please advise about this issue for some deep research. If you need more information please let me know.

Additional context / logs:

USER       PID %CPU %MEM    VSZ     RSS      TTY      STAT START   TIME     COMMAND
pgagent+  1186  341  5.8   2400048 1557152  ?         Rsl  Dec07   5328:23  /usr/bin/coroot-pg-agent --listen=:38888 --scrape-interval=30s postgresql://localhost:5432/postgres?connect_timeout=5&statement_timeout=30000&sslmode=disable
pstree -p -s 1186
systemd(1)───coroot-pg-agent(1186)─┬─{coroot-pg-agent}(1483)
                                   ├─{coroot-pg-agent}(1486)
                                   ├─{coroot-pg-agent}(1487)
                                   ├─{coroot-pg-agent}(1488)
                                   ├─{coroot-pg-agent}(1489)
                                   └─{coroot-pg-agent}(1546)

lsof

coroot-pg  9660        pgagent-exporter  cwd       DIR                8,1       4096         96 /
coroot-pg  9660        pgagent-exporter  rtd       DIR                8,1       4096         96 /
coroot-pg  9660        pgagent-exporter  txt       REG                8,1   10911712   16968830 /usr/bin/coroot-pg-agent
coroot-pg  9660        pgagent-exporter  mem       REG                8,1      61560   25166896 /usr/lib64/libnss_files-2.17.so
coroot-pg  9660        pgagent-exporter  mem       REG                8,1    2156592   25166783 /usr/lib64/libc-2.17.so
coroot-pg  9660        pgagent-exporter  mem       REG                8,1     142144   25166808 /usr/lib64/libpthread-2.17.so
coroot-pg  9660        pgagent-exporter  mem       REG                8,1     163312   25166775 /usr/lib64/ld-2.17.so
coroot-pg  9660        pgagent-exporter    0r      CHR                1,3        0t0       1031 /dev/null
coroot-pg  9660        pgagent-exporter    1w      CHR                1,3        0t0       1031 /dev/null
coroot-pg  9660        pgagent-exporter    2w      CHR                1,3        0t0       1031 /dev/null
coroot-pg  9660        pgagent-exporter    3u     IPv4          185298491        0t0        TCP localhost:37910->localhost:postgres (ESTABLISHED)
coroot-pg  9660        pgagent-exporter    4u  a_inode               0,10          0       6394 [eventpoll]
coroot-pg  9660        pgagent-exporter    5r     FIFO                0,9        0t0  185298488 pipe
coroot-pg  9660        pgagent-exporter    6w     FIFO                0,9        0t0  185298488 pipe
coroot-pg  9660        pgagent-exporter    7u     IPv6          185298499        0t0        TCP *:38888 (LISTEN)
coroot-pg  9660        pgagent-exporter    8u     sock                0,7        0t0  185311614 protocol: TCPv6
coroot-pg  9660        pgagent-exporter    9u     sock                0,7        0t0  185304469 protocol: TCPv6
coroot-pg  9660        pgagent-exporter   10u     sock                0,7        0t0  185305720 protocol: TCPv6
coroot-pg  9660        pgagent-exporter   11u     sock                0,7        0t0  185309402 protocol: TCPv6
coroot-pg  9660        pgagent-exporter   12u     sock                0,7        0t0  185314100 protocol: TCPv6
coroot-pg  9660        pgagent-exporter   13u     sock                0,7        0t0  185320874 protocol: TCPv6
coroot-pg  9660        pgagent-exporter   14u     sock                0,7        0t0  185325269 protocol: TCPv6
coroot-pg  9660        pgagent-exporter   15u     sock                0,7        0t0  185329732 protocol: TCPv6
coroot-pg  9660        pgagent-exporter   16u     sock                0,7        0t0  185334777 protocol: TCPv6
coroot-pg  9660        pgagent-exporter   17u     sock                0,7        0t0  185336509 protocol: TCPv6
coroot-pg  9660        pgagent-exporter   18u     sock                0,7        0t0  185341141 protocol: TCPv6
def commented 1 year ago

Thank you for the report. Can you please share a CPU profile:

curl -o cpu_profile.tgz 'http://<exporter_ip>:<exporter_port>/debug/pprof/profile?seconds=60'
alzamonn commented 1 year ago

report attached. thanks for reply cpu_profile1714.zip

alzamonn commented 1 year ago

also adding more info

/usr/bin/coroot-pg-agent --version
1.2.0

metric request does not work and freezes in this moment

 curl http://localhost:38888/metrics > metrics1.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  **0:04:41** --:--:--     0
def commented 1 year ago

Looks like there are very large queries in pg_stat_statements. Let's check this out: select count(1), max(length(query)) from pg_stat_statements

alzamonn commented 1 year ago

hi! I checked.

first db

postgres=# select count(1), max(length(query)) from pg_stat_statements;
 count |  max
-------+--------
  4773 | 243392
(1 row)

second db

postgres=# select count(1), max(length(query)) from pg_stat_statements;
 count |  max
-------+--------
  4771 | 630379
(1 row)

for instance, other random db in our environment without problem for now

postgres=# select count(1), max(length(query)) from pg_stat_statements;
 count |  max
-------+-------
  2279 | 24206
(1 row)
def commented 1 year ago

Yeah, 630kb is a lot. We definitely should trim such long queries. Will fix it ASAP.

apetruhin commented 1 year ago

@alzamonn, please update to coroot-pg-agent:1.2.1

dxops commented 1 year ago

Hey guys, pg_stat_statements treats different amount of parameters like different queries:

9.6 DELETE FROM table WHERE id1 IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?); DELETE FROM table WHERE id1 IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);

Fixed in PG 10 https://github.com/postgres/postgres/commit/83f2061dd037477ec8479ee160367840e203a722 Some attempts to fix https://stackoverflow.com/questions/74220833/how-to-make-pg-stat-statement-merge-queries-with-variable-number-of-parameters-i

With 1000 ids in array paramter those will be pretty long queries.

One more item affects all PG versions still:

10+ https://github.com/postgres/postgres/commit/a6f22e83562d8b78293229587cd3d9430d16d466 INSERT INTO table (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10) values ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10),($11,$12,$13,$14,$15,$16,$17,$18,$19,$20); INSERT INTO table (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10) values ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10);

9.6 INSERT INTO table (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10) values (?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?); INSERT INTO table (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10) values (?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?),(?,?,?,?,?,?,?,?,?,?);

Instead of cutting off WHERE statements in the end of the query it's better to reduce amount of parameters to a single one to make queries shorter and distinct.

For 1000+ column tables make sense to replace columns list with * to reduce query size as well, because columnw list will not affect any PG performance stats.

def commented 1 year ago

@dxops, in this particular case the most CPU time was spent on query text normalization. So, I don't see any other way to make CPU consumption predictable than to trim query texts.

Collapsing repetitive query arguments is already implemented in pg-agent.