bucardo / check_postgres

Nagios check_postgres plugin for checking status of PostgreSQL databases
http://bucardo.org/wiki/Check_postgres
Other
561 stars 175 forks source link

Bloat check is extremely slow with one server but fast on another server #169

Closed esabol closed 4 years ago

esabol commented 4 years ago

I have two servers running PostgreSQL 10.11 that are configured the same, contain mostly the same databases and mostly the same tables.

On one server, the bloat check takes 3-4 seconds.

On the other server, the bloat check takes 24-30 minutes.

Any idea why the second server is like 500 times slower? Normal queries seem fine.

Also, this condition just randomly started like 3 days ago. I've restarted the "slow" server and done a "vacuum analyze;" on it. Any help or suggestions would be appreciated. Thanks!

esabol commented 4 years ago

Hmmm, well, this bloat subquery:

SELECT
        ns.nspname, tbl.relname, hdr, ma, bs,
        SUM((1-coalesce(null_frac,0))*coalesce(avg_width, 2048)) AS datawidth,
        MAX(coalesce(null_frac,0)) AS maxfracsum,
        hdr+(
          SELECT 1+count(*)/8
          FROM pg_stats s2
          WHERE null_frac<>0 AND s2.schemaname = ns.nspname AND s2.tablename = tbl.relname
        ) AS nullhdr
      FROM pg_attribute att
      JOIN pg_class tbl ON att.attrelid = tbl.oid
      JOIN pg_namespace ns ON ns.oid = tbl.relnamespace
      LEFT JOIN pg_stats s ON s.schemaname=ns.nspname
      AND s.tablename = tbl.relname
      AND s.inherited=false
      AND s.attname=att.attname,
      (
        SELECT
          (SELECT current_setting('block_size')::numeric) AS bs,
            CASE WHEN SUBSTRING(SPLIT_PART(v, ' ', 2) FROM '#\"[0-9]+.[0-9]+#\"%' for '#')
              IN ('8.0','8.1','8.2') THEN 27 ELSE 23 END AS hdr,
          CASE WHEN v ~ 'mingw32' OR v ~ '64-bit' THEN 8 ELSE 4 END AS ma
        FROM (SELECT version() AS v) AS foo
      ) AS constants
      WHERE att.attnum > 0 AND tbl.relkind='r'
      GROUP BY 1,2,3,4,5;

returns 77,400 rows on the "slow" server and like 2,200 rows on the "fast" server.

select count(*) from pg_stats returns 814512 on the "slow" server and 45193 on the "fast" server. Is there a way to shrink/clear pg_stats? Help, anyone?

esabol commented 4 years ago

Since this doesn't seem to have much to do with check_postgres specifically, I have posted my question over on Stack Exchange in case anyone would prefer to answer there.

https://dba.stackexchange.com/questions/254847/pg-stats-is-huge-and-slowing-postgresql-server-down-what-do-i-do

machack666 commented 4 years ago

I don't think those rowcounts are particularly concerning in and of themselves; can you paste the EXPLAIN (with and without ANALYZE) output for the slow/fast servers?

With 30 minute runtime, I'd be looking for locks instead of something inherently slow about this query, but checking on if there are any differences in the EXPLAIN plans is a good start.

esabol commented 4 years ago

@machack666: Thanks for your reply. Sorry it has taken so long to get back to you. Holidays and such.

Anyway, this continues to be a problem for me. I'm not sure what the EXPLAIN output shows other than the number of rows for various parts of the query being significantly higher on the one server compared to the other. But if you gain any insights from this, I'd love to hear them. Thanks!

Here's the output of EXPLAIN on the "slow" server:

                                                                                                                                            QUERY PLAN                                                                                                                                            
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=1935395.99..4529587255.06 rows=4097995 width=188)
   Group Key: ns.nspname, tbl.relname, CASE WHEN ("substring"(split_part(version(), ' '::text, 2), '#\"[0-9]+.[0-9]+#\"%'::text, '#'::text) = ANY ('{8.0,8.1,8.2}'::text[])) THEN 27 ELSE 23 END, CASE WHEN ((version() ~ 'mingw32'::text) OR (version() ~ '64-bit'::text)) THEN 8 ELSE 4 END, $0
   InitPlan 1 (returns $0)
     ->  Result  (cost=0.00..0.02 rows=1 width=32)
   ->  Merge Left Join  (cost=1935395.97..2063714.73 rows=4097995 width=176)
         Merge Cond: ((ns.nspname = s.schemaname) AND (tbl.relname = s.tablename) AND (att.attname = s.attname))
         ->  Sort  (cost=1427752.62..1437997.60 rows=4097995 width=192)
               Sort Key: ns.nspname, tbl.relname, att.attname
               ->  Hash Join  (cost=15335.36..599474.35 rows=4097995 width=192)
                     Hash Cond: (tbl.relnamespace = ns.oid)
                     ->  Hash Join  (cost=15246.41..588609.50 rows=4097995 width=132)
                           Hash Cond: (att.attrelid = tbl.oid)
                           ->  Seq Scan on pg_attribute att  (cost=0.00..532962.28 rows=15390604 width=68)
                                 Filter: (attnum > 0)
                           ->  Hash  (cost=14050.06..14050.06 rows=95708 width=72)
                                 ->  Seq Scan on pg_class tbl  (cost=0.00..14050.06 rows=95708 width=72)
                                       Filter: (relkind = 'r'::"char")
                     ->  Hash  (cost=58.42..58.42 rows=2442 width=68)
                           ->  Seq Scan on pg_namespace ns  (cost=0.00..58.42 rows=2442 width=68)
         ->  Sort  (cost=507643.36..507707.46 rows=25640 width=200)
               Sort Key: s.schemaname, s.tablename, s.attname
               ->  Subquery Scan on s  (cost=23845.50..505765.73 rows=25640 width=200)
                     ->  Gather  (cost=23845.50..505509.33 rows=25640 width=401)
                           Workers Planned: 2
                           ->  Hash Left Join  (cost=22845.50..501945.33 rows=10683 width=401)
                                 Hash Cond: (c.relnamespace = n.oid)
                                 ->  Hash Join  (cost=22756.56..501828.29 rows=10683 width=140)
                                       Hash Cond: (s_1.starelid = c.oid)
                                       Join Filter: has_column_privilege(c.oid, a.attnum, 'select'::text)
                                       ->  Nested Loop  (cost=0.43..462367.09 rows=416791 width=82)
                                             ->  Parallel Seq Scan on pg_statistic s_1  (cost=0.00..53955.73 rows=236437 width=14)
                                                   Filter: (NOT stainherit)
                                             ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a  (cost=0.43..3.02 rows=2 width=70)
                                                   Index Cond: ((attrelid = s_1.starelid) AND (attnum = s_1.staattnum))
                                                   Filter: (NOT attisdropped)
                                       ->  Hash  (cost=14050.06..14050.06 rows=359445 width=72)
                                             ->  Seq Scan on pg_class c  (cost=0.00..14050.06 rows=359445 width=72)
                                                   Filter: ((NOT relrowsecurity) OR (NOT row_security_active(oid)))
                                 ->  Hash  (cost=58.42..58.42 rows=2442 width=68)
                                       ->  Seq Scan on pg_namespace n  (cost=0.00..58.42 rows=2442 width=68)
   SubPlan 2
     ->  Aggregate  (cost=1104.73..1104.74 rows=1 width=8)
           ->  Nested Loop  (cost=1.56..1104.72 rows=1 width=401)
                 Join Filter: (c_1.oid = s_2.starelid)
                 ->  Nested Loop  (cost=1.14..1103.30 rows=1 width=10)
                       ->  Nested Loop  (cost=0.70..16.75 rows=1 width=4)
                             Join Filter: (c_1.relnamespace = n_1.oid)
                             ->  Index Scan using pg_class_relname_nsp_index on pg_class c_1  (cost=0.42..8.44 rows=1 width=8)
                                   Index Cond: (relname = tbl.relname)
                                   Filter: ((NOT relrowsecurity) OR (NOT row_security_active(oid)))
                             ->  Index Scan using pg_namespace_nspname_index on pg_namespace n_1  (cost=0.28..8.30 rows=1 width=4)
                                   Index Cond: (nspname = ns.nspname)
                       ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a_1  (cost=0.43..1084.53 rows=201 width=6)
                             Index Cond: (attrelid = c_1.oid)
                             Filter: ((NOT attisdropped) AND has_column_privilege(c_1.oid, attnum, 'select'::text))
                 ->  Index Scan using pg_statistic_relid_att_inh_index on pg_statistic s_2  (cost=0.43..1.07 rows=28 width=6)
                       Index Cond: ((starelid = a_1.attrelid) AND (staattnum = a_1.attnum))
                       Filter: (stanullfrac <> '0'::double precision)
(58 rows)

Here's the output of EXPLAIN on the "fast" server:

                                                                                                                                            QUERY PLAN                                                                                                                                            
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=18339.88..841921.77 rows=13555 width=188)
   Group Key: ns.nspname, tbl.relname, CASE WHEN ("substring"(split_part(version(), ' '::text, 2), '#\"[0-9]+.[0-9]+#\"%'::text, '#'::text) = ANY ('{8.0,8.1,8.2}'::text[])) THEN 27 ELSE 23 END, CASE WHEN ((version() ~ 'mingw32'::text) OR (version() ~ '64-bit'::text)) THEN 8 ELSE 4 END, $0
   InitPlan 1 (returns $0)
     ->  Result  (cost=0.00..0.02 rows=1 width=32)
   ->  Merge Left Join  (cost=18339.86..18764.78 rows=13555 width=176)
         Merge Cond: ((ns.nspname = s.schemaname) AND (tbl.relname = s.tablename) AND (att.attname = s.attname))
         ->  Sort  (cost=5887.15..5921.04 rows=13555 width=192)
               Sort Key: ns.nspname, tbl.relname, att.attname
               ->  Hash Join  (cost=708.49..4956.84 rows=13555 width=192)
                     Hash Cond: (tbl.relnamespace = ns.oid)
                     ->  Hash Join  (cost=670.68..4883.30 rows=13555 width=132)
                           Hash Cond: (att.attrelid = tbl.oid)
                           ->  Seq Scan on pg_attribute att  (cost=0.00..4007.49 rows=78125 width=68)
                                 Filter: (attnum > 0)
                           ->  Hash  (cost=637.01..637.01 rows=2693 width=72)
                                 ->  Seq Scan on pg_class tbl  (cost=0.00..637.01 rows=2693 width=72)
                                       Filter: (relkind = 'r'::"char")
                     ->  Hash  (cost=25.14..25.14 rows=1014 width=68)
                           ->  Seq Scan on pg_namespace ns  (cost=0.00..25.14 rows=1014 width=68)
         ->  Sort  (cost=12452.71..12453.04 rows=131 width=200)
               Sort Key: s.schemaname, s.tablename, s.attname
               ->  Subquery Scan on s  (cost=5455.47..12448.10 rows=131 width=200)
                     ->  Hash Left Join  (cost=5455.47..12446.79 rows=131 width=401)
                           Hash Cond: (c.relnamespace = n.oid)
                           ->  Hash Join  (cost=5417.65..12408.63 rows=131 width=140)
                                 Hash Cond: ((s_1.starelid = c.oid) AND (s_1.staattnum = a.attnum))
                                 ->  Seq Scan on pg_statistic s_1  (cost=0.00..6793.45 rows=19622 width=14)
                                       Filter: (NOT stainherit)
                                 ->  Hash  (cost=4852.66..4852.66 rows=37666 width=142)
                                       ->  Hash Join  (cost=831.03..4852.66 rows=37666 width=142)
                                             Hash Cond: (a.attrelid = c.oid)
                                             Join Filter: has_column_privilege(c.oid, a.attnum, 'select'::text)
                                             ->  Seq Scan on pg_attribute a  (cost=0.00..3724.99 rows=112999 width=70)
                                                   Filter: (NOT attisdropped)
                                             ->  Hash  (cost=637.01..637.01 rows=15521 width=72)
                                                   ->  Seq Scan on pg_class c  (cost=0.00..637.01 rows=15521 width=72)
                                                         Filter: ((NOT relrowsecurity) OR (NOT row_security_active(oid)))
                           ->  Hash  (cost=25.14..25.14 rows=1014 width=68)
                                 ->  Seq Scan on pg_namespace n  (cost=0.00..25.14 rows=1014 width=68)
   SubPlan 2
     ->  Aggregate  (cost=60.64..60.66 rows=1 width=8)
           ->  Nested Loop  (cost=1.39..60.63 rows=1 width=401)
                 Join Filter: (c_1.oid = s_2.starelid)
                 ->  Nested Loop  (cost=1.10..60.06 rows=1 width=10)
                       ->  Nested Loop  (cost=0.69..16.73 rows=1 width=4)
                             Join Filter: (c_1.relnamespace = n_1.oid)
                             ->  Index Scan using pg_class_relname_nsp_index on pg_class c_1  (cost=0.41..8.43 rows=1 width=8)
                                   Index Cond: (relname = tbl.relname)
                                   Filter: ((NOT relrowsecurity) OR (NOT row_security_active(oid)))
                             ->  Index Scan using pg_namespace_nspname_index on pg_namespace n_1  (cost=0.28..8.29 rows=1 width=4)
                                   Index Cond: (nspname = ns.nspname)
                       ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a_1  (cost=0.42..43.29 rows=4 width=6)
                             Index Cond: (attrelid = c_1.oid)
                             Filter: ((NOT attisdropped) AND has_column_privilege(c_1.oid, attnum, 'select'::text))
                 ->  Index Scan using pg_statistic_relid_att_inh_index on pg_statistic s_2  (cost=0.29..0.55 rows=1 width=6)
                       Index Cond: ((starelid = a_1.attrelid) AND (staattnum = a_1.attnum))
                       Filter: (stanullfrac <> '0'::double precision)
(57 rows)

Here's EXPLAIN ANALYZE on the "slow" server:

                                                                                                                                            QUERY PLAN                                                                                                                                            
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=1935395.99..4529587255.06 rows=4097995 width=188) (actual time=105387.412..409241.072 rows=96069 loops=1)
   Group Key: ns.nspname, tbl.relname, CASE WHEN ("substring"(split_part(version(), ' '::text, 2), '#\"[0-9]+.[0-9]+#\"%'::text, '#'::text) = ANY ('{8.0,8.1,8.2}'::text[])) THEN 27 ELSE 23 END, CASE WHEN ((version() ~ 'mingw32'::text) OR (version() ~ '64-bit'::text)) THEN 8 ELSE 4 END, $0
   InitPlan 1 (returns $0)
     ->  Result  (cost=0.00..0.02 rows=1 width=32) (actual time=0.019..0.021 rows=1 loops=1)
   ->  Merge Left Join  (cost=1935395.97..2063714.73 rows=4097995 width=176) (actual time=105360.960..264905.005 rows=13304172 loops=1)
         Merge Cond: ((ns.nspname = s.schemaname) AND (tbl.relname = s.tablename) AND (att.attname = s.attname))
         ->  Sort  (cost=1427752.62..1437997.60 rows=4097995 width=192) (actual time=99093.676..111983.008 rows=13304172 loops=1)
               Sort Key: ns.nspname, tbl.relname, att.attname
               Sort Method: external merge  Disk: 2629600kB
               ->  Hash Join  (cost=15335.36..599474.35 rows=4097995 width=192) (actual time=241.218..54844.544 rows=13304172 loops=1)
                     Hash Cond: (tbl.relnamespace = ns.oid)
                     ->  Hash Join  (cost=15246.41..588609.50 rows=4097995 width=132) (actual time=230.577..36508.735 rows=13304172 loops=1)
                           Hash Cond: (att.attrelid = tbl.oid)
                           ->  Seq Scan on pg_attribute att  (cost=0.00..532962.28 rows=15390604 width=68) (actual time=0.007..17032.056 rows=13870158 loops=1)
                                 Filter: (attnum > 0)
                                 Rows Removed by Filter: 1147617
                           ->  Hash  (cost=14050.06..14050.06 rows=95708 width=72) (actual time=230.037..230.038 rows=96069 loops=1)
                                 Buckets: 131072  Batches: 1  Memory Usage: 10782kB
                                 ->  Seq Scan on pg_class tbl  (cost=0.00..14050.06 rows=95708 width=72) (actual time=0.006..145.961 rows=96069 loops=1)
                                       Filter: (relkind = 'r'::"char")
                                       Rows Removed by Filter: 270863
                     ->  Hash  (cost=58.42..58.42 rows=2442 width=68) (actual time=10.609..10.610 rows=2393 loops=1)
                           Buckets: 4096  Batches: 1  Memory Usage: 266kB
                           ->  Seq Scan on pg_namespace ns  (cost=0.00..58.42 rows=2442 width=68) (actual time=0.007..8.580 rows=2393 loops=1)
         ->  Sort  (cost=507643.36..507707.46 rows=25640 width=200) (actual time=6266.990..6296.231 rows=37000 loops=1)
               Sort Key: s.schemaname, s.tablename, s.attname
               Sort Method: quicksort  Memory: 11365kB
               ->  Subquery Scan on s  (cost=23845.50..505765.73 rows=25640 width=200) (actual time=1224.098..6193.233 rows=37000 loops=1)
                     ->  Gather  (cost=23845.50..505509.33 rows=25640 width=401) (actual time=1224.095..6111.890 rows=37000 loops=1)
                           Workers Planned: 2
                           Workers Launched: 2
                           ->  Hash Left Join  (cost=22845.50..501945.33 rows=10683 width=401) (actual time=1102.681..6118.061 rows=12333 loops=3)
                                 Hash Cond: (c.relnamespace = n.oid)
                                 ->  Hash Join  (cost=22756.56..501828.29 rows=10683 width=140) (actual time=1097.227..6082.293 rows=12333 loops=3)
                                       Hash Cond: (s_1.starelid = c.oid)
                                       Join Filter: has_column_privilege(c.oid, a.attnum, 'select'::text)
                                       Rows Removed by Join Filter: 266947
                                       ->  Nested Loop  (cost=0.43..462367.09 rows=416791 width=82) (actual time=9.392..2744.783 rows=279280 loops=3)
                                             ->  Parallel Seq Scan on pg_statistic s_1  (cost=0.00..53955.73 rows=236437 width=14) (actual time=0.024..587.171 rows=279280 loops=3)
                                                   Filter: (NOT stainherit)
                                             ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a  (cost=0.43..3.02 rows=2 width=70) (actual time=0.004..0.004 rows=1 loops=837840)
                                                   Index Cond: ((attrelid = s_1.starelid) AND (attnum = s_1.staattnum))
                                                   Filter: (NOT attisdropped)
                                       ->  Hash  (cost=14050.06..14050.06 rows=359445 width=72) (actual time=1071.775..1071.776 rows=366932 loops=3)
                                             Buckets: 262144  Batches: 2  Memory Usage: 20667kB
                                             ->  Seq Scan on pg_class c  (cost=0.00..14050.06 rows=359445 width=72) (actual time=0.008..530.066 rows=366932 loops=3)
                                                   Filter: ((NOT relrowsecurity) OR (NOT row_security_active(oid)))
                                 ->  Hash  (cost=58.42..58.42 rows=2442 width=68) (actual time=5.421..5.422 rows=2393 loops=3)
                                       Buckets: 4096  Batches: 1  Memory Usage: 266kB
                                       ->  Seq Scan on pg_namespace n  (cost=0.00..58.42 rows=2442 width=68) (actual time=0.026..2.686 rows=2393 loops=3)
   SubPlan 2
     ->  Aggregate  (cost=1104.73..1104.74 rows=1 width=8) (actual time=1.269..1.270 rows=1 loops=96069)
           ->  Nested Loop  (cost=1.56..1104.72 rows=1 width=401) (actual time=1.264..1.267 rows=0 loops=96069)
                 Join Filter: (c_1.oid = s_2.starelid)
                 ->  Nested Loop  (cost=1.14..1103.30 rows=1 width=10) (actual time=1.261..1.263 rows=0 loops=96069)
                       ->  Nested Loop  (cost=0.70..16.75 rows=1 width=4) (actual time=0.174..0.338 rows=1 loops=96069)
                             Join Filter: (c_1.relnamespace = n_1.oid)
                             Rows Removed by Join Filter: 57
                             ->  Index Scan using pg_class_relname_nsp_index on pg_class c_1  (cost=0.42..8.44 rows=1 width=8) (actual time=0.006..0.066 rows=58 loops=96069)
                                   Index Cond: (relname = tbl.relname)
                                   Filter: ((NOT relrowsecurity) OR (NOT row_security_active(oid)))
                             ->  Index Scan using pg_namespace_nspname_index on pg_namespace n_1  (cost=0.28..8.30 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=5531986)
                                   Index Cond: (nspname = ns.nspname)
                       ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a_1  (cost=0.43..1084.53 rows=201 width=6) (actual time=0.920..0.921 rows=0 loops=96069)
                             Index Cond: (attrelid = c_1.oid)
                             Filter: ((NOT attisdropped) AND has_column_privilege(c_1.oid, attnum, 'select'::text))
                             Rows Removed by Filter: 144
                 ->  Index Scan using pg_statistic_relid_att_inh_index on pg_statistic s_2  (cost=0.43..1.07 rows=28 width=6) (actual time=0.004..0.005 rows=0 loops=43518)
                       Index Cond: ((starelid = a_1.attrelid) AND (staattnum = a_1.attnum))
                       Filter: (stanullfrac <> '0'::double precision)
                       Rows Removed by Filter: 0
 Planning time: 3.489 ms
 Execution time: 409948.049 ms
(73 rows)

And here's EXPLAIN ANALYZE on the "fast" server:

                                                                                                                                            QUERY PLAN                                                                                                                                            
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=18339.88..841921.77 rows=13555 width=188) (actual time=1279.413..7873.152 rows=2694 loops=1)
   Group Key: ns.nspname, tbl.relname, CASE WHEN ("substring"(split_part(version(), ' '::text, 2), '#\"[0-9]+.[0-9]+#\"%'::text, '#'::text) = ANY ('{8.0,8.1,8.2}'::text[])) THEN 27 ELSE 23 END, CASE WHEN ((version() ~ 'mingw32'::text) OR (version() ~ '64-bit'::text)) THEN 8 ELSE 4 END, $0
   InitPlan 1 (returns $0)
     ->  Result  (cost=0.00..0.02 rows=1 width=32) (actual time=0.028..0.029 rows=1 loops=1)
   ->  Merge Left Join  (cost=18339.86..18764.78 rows=13555 width=176) (actual time=1273.616..1967.242 rows=55107 loops=1)
         Merge Cond: ((ns.nspname = s.schemaname) AND (tbl.relname = s.tablename) AND (att.attname = s.attname))
         ->  Sort  (cost=5887.15..5921.04 rows=13555 width=192) (actual time=336.046..374.181 rows=55107 loops=1)
               Sort Key: ns.nspname, tbl.relname, att.attname
               Sort Method: quicksort  Memory: 16174kB
               ->  Hash Join  (cost=708.49..4956.84 rows=13555 width=192) (actual time=9.790..253.763 rows=55107 loops=1)
                     Hash Cond: (tbl.relnamespace = ns.oid)
                     ->  Hash Join  (cost=670.68..4883.30 rows=13555 width=132) (actual time=7.602..174.154 rows=55107 loops=1)
                           Hash Cond: (att.attrelid = tbl.oid)
                           ->  Seq Scan on pg_attribute att  (cost=0.00..4007.49 rows=78125 width=68) (actual time=0.004..68.395 rows=82849 loops=1)
                                 Filter: (attnum > 0)
                                 Rows Removed by Filter: 37659
                           ->  Hash  (cost=637.01..637.01 rows=2693 width=72) (actual time=7.570..7.571 rows=2694 loops=1)
                                 Buckets: 4096  Batches: 1  Memory Usage: 306kB
                                 ->  Seq Scan on pg_class tbl  (cost=0.00..637.01 rows=2693 width=72) (actual time=0.007..5.374 rows=2694 loops=1)
                                       Filter: (relkind = 'r'::"char")
                                       Rows Removed by Filter: 12831
                     ->  Hash  (cost=25.14..25.14 rows=1014 width=68) (actual time=2.124..2.124 rows=1036 loops=1)
                           Buckets: 2048 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 118kB
                           ->  Seq Scan on pg_namespace ns  (cost=0.00..25.14 rows=1014 width=68) (actual time=0.013..0.988 rows=1036 loops=1)
         ->  Sort  (cost=12452.71..12453.04 rows=131 width=200) (actual time=937.244..965.483 rows=36997 loops=1)
               Sort Key: s.schemaname, s.tablename, s.attname
               Sort Method: quicksort  Memory: 11364kB
               ->  Subquery Scan on s  (cost=5455.47..12448.10 rows=131 width=200) (actual time=669.874..878.407 rows=36997 loops=1)
                     ->  Hash Left Join  (cost=5455.47..12446.79 rows=131 width=401) (actual time=669.872..830.487 rows=36997 loops=1)
                           Hash Cond: (c.relnamespace = n.oid)
                           ->  Hash Join  (cost=5417.65..12408.63 rows=131 width=140) (actual time=668.274..774.427 rows=36997 loops=1)
                                 Hash Cond: ((s_1.starelid = c.oid) AND (s_1.staattnum = a.attnum))
                                 ->  Seq Scan on pg_statistic s_1  (cost=0.00..6793.45 rows=19622 width=14) (actual time=0.013..41.770 rows=45420 loops=1)
                                       Filter: (NOT stainherit)
                                 ->  Hash  (cost=4852.66..4852.66 rows=37666 width=142) (actual time=667.929..667.929 rows=69779 loops=1)
                                       Buckets: 131072 (originally 65536)  Batches: 1 (originally 1)  Memory Usage: 12881kB
                                       ->  Hash Join  (cost=831.03..4852.66 rows=37666 width=142) (actual time=27.496..596.287 rows=69779 loops=1)
                                             Hash Cond: (a.attrelid = c.oid)
                                             Join Filter: has_column_privilege(c.oid, a.attnum, 'select'::text)
                                             Rows Removed by Join Filter: 50729
                                             ->  Seq Scan on pg_attribute a  (cost=0.00..3724.99 rows=112999 width=70) (actual time=0.005..105.518 rows=120508 loops=1)
                                                   Filter: (NOT attisdropped)
                                             ->  Hash  (cost=637.01..637.01 rows=15521 width=72) (actual time=27.378..27.379 rows=15525 loops=1)
                                                   Buckets: 16384  Batches: 1  Memory Usage: 1705kB
                                                   ->  Seq Scan on pg_class c  (cost=0.00..637.01 rows=15521 width=72) (actual time=0.006..14.859 rows=15525 loops=1)
                                                         Filter: ((NOT relrowsecurity) OR (NOT row_security_active(oid)))
                           ->  Hash  (cost=25.14..25.14 rows=1014 width=68) (actual time=1.584..1.585 rows=1036 loops=1)
                                 Buckets: 2048 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 118kB
                                 ->  Seq Scan on pg_namespace n  (cost=0.00..25.14 rows=1014 width=68) (actual time=0.010..0.743 rows=1036 loops=1)
   SubPlan 2
     ->  Aggregate  (cost=60.64..60.66 rows=1 width=8) (actual time=2.157..2.157 rows=1 loops=2694)
           ->  Nested Loop  (cost=1.39..60.63 rows=1 width=401) (actual time=2.071..2.150 rows=6 loops=2694)
                 Join Filter: (c_1.oid = s_2.starelid)
                 ->  Nested Loop  (cost=1.10..60.06 rows=1 width=10) (actual time=2.023..2.072 rows=16 loops=2694)
                       ->  Nested Loop  (cost=0.69..16.73 rows=1 width=4) (actual time=1.006..2.008 rows=1 loops=2694)
                             Join Filter: (c_1.relnamespace = n_1.oid)
                             Rows Removed by Join Filter: 364
                             ->  Index Scan using pg_class_relname_nsp_index on pg_class c_1  (cost=0.41..8.43 rows=1 width=8) (actual time=0.005..0.367 rows=365 loops=2694)
                                   Index Cond: (relname = tbl.relname)
                                   Filter: ((NOT relrowsecurity) OR (NOT row_security_active(oid)))
                             ->  Index Scan using pg_namespace_nspname_index on pg_namespace n_1  (cost=0.28..8.29 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=983801)
                                   Index Cond: (nspname = ns.nspname)
                       ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a_1  (cost=0.42..43.29 rows=4 width=6) (actual time=0.018..0.042 rows=16 loops=2694)
                             Index Cond: (attrelid = c_1.oid)
                             Filter: ((NOT attisdropped) AND has_column_privilege(c_1.oid, attnum, 'select'::text))
                             Rows Removed by Filter: 10
                 ->  Index Scan using pg_statistic_relid_att_inh_index on pg_statistic s_2  (cost=0.29..0.55 rows=1 width=6) (actual time=0.002..0.003 rows=0 loops=43518)
                       Index Cond: ((starelid = a_1.attrelid) AND (staattnum = a_1.attnum))
                       Filter: (stanullfrac <> '0'::double precision)
                       Rows Removed by Filter: 0
 Planning time: 7.599 ms
 Execution time: 7877.784 ms
(72 rows)
esabol commented 4 years ago

I just upgraded to check_postgres 2.25, and I'm still seeing this issue.

It's actually gotten worse. The bloat check on the one server is now taking 59 minutes!

machack666 commented 4 years ago

Hmm, can you paste the new plans? I don't recall offhand if the queries changed in 2.25, but would be good to get the latest data.

What is your default_statistics_target on each server?

esabol commented 4 years ago

Hmm, can you paste the new plans? I don't recall offhand if the queries changed in 2.25, but would be good to get the latest data.

I really doubt they've changed and nobody has replied to the plans I've already posted. I feel like you're just giving me busy work.

What is your default_statistics_target on each server?

100 on both servers. Unchanged from the default values.

machack666 commented 4 years ago

And dumb question, these servers are specced similarly and have been tuned for the server in question? Can you get a show all of the various servers (at least the settings that show up in https://pgtune.leopard.in.ua/#/)?

machack666 commented 4 years ago

@esabol looking at the "slow" queries, it appears the big time sink here is in the pg_attribute table, which is quite bloated itself. How many active relations are on this table, and have you had a lot of system column changes on the "slow" server (changing columns I would assume primarily)?

I would recommend trying to vacuum the pg_attribute and pg_class tables and see if this improves the performance.

esabol commented 4 years ago

@esabol looking at the "slow" queries, it appears the big time sink here is in the pg_attribute table, which is quite bloated itself. How many active relations are on this table, and have you had a lot of system column changes on the "slow" server (changing columns I would assume primarily)?

AFAIK, we almost never rename columns. We do rename tables frequently, however. Our typical procedure when updating a table is to ingest the new data into a table with a temporary name and then, in a transaction, drop the old table and rename the new copy of the table to the old table name.

But we also do the same thing on the "fast" server and just as frequently. (The two servers are intended to mirror each other. The "slow" server is queried more frequently though, and I think it has more namespaces.)

I would recommend trying to vacuum the pg_attribute and pg_class tables and see if this improves the performance.

OK, I did a 'vacuum' on both of those. I'll post tomorrow if there's any improvement.

Fwiw, I already did a vacuum full on every database on the slow server a month ago with no significant effect.

Thanks for replying!

esabol commented 4 years ago

I'll post tomorrow if there's any improvement.

Not much. The bloat query went from taking 59 minutes a couple weeks ago to taking 57 minutes today.

I think you're right about pg_attribute being the source of the problem though. select count(*) from pg_attribute returns 15028184, where as on the "fast" server, it returns only 124549. It's 120 times larger on the slow server than on the fast server. vacuum full pg_attribute; does not reduce the number of rows in pg_attribute.

And dumb question, these servers are specced similarly and have been tuned for the server in question? Can you get a show all of the various servers (at least the settings that show up in https://pgtune.leopard.in.ua/#/)?

Sorry, I just noticed this. I can't seem to access https://pgtune.leopard.in.ua/. It's not responding at the moment.

The hardware that the two servers run on are comparable. I ran pgtune on both years ago and got the same results. The postgresql.conf files are currently identical. The autovacuum* settings are unchanged from the defaults. Most settings are unchanged from the defaults, really. pgtune changed the following:

maintenance_work_mem = 320MB
checkpoint_completion_target = 0.7
work_mem = 24MB
wal_buffers = 4MB
shared_buffers = 1280MB
esabol commented 4 years ago

Ah, ha! Mystery solved. Looking in pg_attribute was key. One of our developers had a runaway process which created a hundred thousand tables which were supposed to be temporary but weren't and then "forgot" to clean them up. They were in a non-public schema, so I didn't notice them.

Thanks for your help! Closing issue.