pg_stat_all_tables info not reliable (or direct execute problem).

yazun commented 3 years ago

We rely on updating stats on the coordinators on a global view by comparing pg_stat_all_tables differences between datanode and coordinator. Noticing however very worrisome, erratic behaviour of the pg_stat_all_tables values. Below you can see that values returned are either ok or nulls - they query is executed in second intervals and seem randomy returning proper values or nothing. This basically breaks coordinators update (analyze(coordinator).

Do you have any idea why that could be happening? - Seems like a caching problem of some stale pool connection maybe? How to fix it?

Thanks

Time: 57.723 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > execute direct on (datanode4)$$select * from pg_stat_all_tables where relname ~ 'tbl_partition_1' $$;
  relid  |    schemaname     |     relname     | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup | n_dead_tup | n_mod_since_analyze |          last_vacuum          | last_autovacuum |         last_analyze         | last_autoanalyze | vacuum_count | autovacuum_count | analyze_count | autoanalyze_count
---------+-------------------+--------------------------------------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------+------------+------------+---------------------+-------------------------------+-----------------+------------------------------+------------------+--------------+------------------+---------------+-------------------
 6369512 | dr3_ops_cs36_part | tbl_partition_1 |        0 |            0 |  5813523 |       2928286 |         0 |         0 |         0 |             0 |   11498085 |          0 |                   0 | 2021-01-14 09:47:33.141461+01 | [null]          | 2021-01-14 09:49:35.76404+01 | [null]           |            3 |                0 |             3 |                 0
(1 row)

Time: 46.543 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > execute direct on (datanode4)$$select * from pg_stat_all_tables where relname ~ 'tbl_partition_1' $$;
  relid  |    schemaname     |     relname     | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup | n_dead_tup | n_mod_since_analyze |          last_vacuum          | last_autovacuum |         last_analyze         | last_autoanalyze | vacuum_count | autovacuum_count | analyze_count | autoanalyze_count
---------+-------------------+--------------------------------------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------+------------+------------+---------------------+-------------------------------+-----------------+------------------------------+------------------+--------------+------------------+---------------+-------------------
 6369512 | dr3_ops_cs36_part | tbl_partition_1 |        0 |            0 |  5813523 |       2928286 |         0 |         0 |         0 |             0 |   11498085 |          0 |                   0 | 2021-01-14 09:47:33.141461+01 | [null]          | 2021-01-14 09:49:35.76404+01 | [null]           |            3 |                0 |             3 |                 0
(1 row)

Time: 58.182 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > execute direct on (datanode4)$$select * from pg_stat_all_tables where relname ~ 'tbl_partition_1' $$;
  relid  |    schemaname     |     relname     | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup | n_dead_tup | n_mod_since_analyze | last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | vacuum_count | autovacuum_count | analyze_count | autoanalyze_count
---------+-------------------+--------------------------------------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------+------------+------------+---------------------+-------------+-----------------+--------------+------------------+--------------+------------------+---------------+-------------------
 6369512 | dr3_ops_cs36_part | tbl_partition_1 |        0 |            0 |        0 |             0 |         0 |         0 |         0 |             0 |          0 |          0 |                   0 | [null]      | [null]          | [null]       | [null]           |            0 |                0 |             0 |                 0
(1 row)

Time: 50.242 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > execute direct on (datanode4)$$select * from pg_stat_all_tables where relname ~ 'tbl_partition_1' $$;
  relid  |    schemaname     |     relname     | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup | n_dead_tup | n_mod_since_analyze | last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | vacuum_count | autovacuum_count | analyze_count | autoanalyze_count
---------+-------------------+--------------------------------------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------+------------+------------+---------------------+-------------+-----------------+--------------+------------------+--------------+------------------+---------------+-------------------
 6369512 | dr3_ops_cs36_part | tbl_partition_1 |        0 |            0 |        0 |             0 |         0 |         0 |         0 |             0 |          0 |          0 |                   0 | [null]      | [null]          | [null]       | [null]           |            0 |                0 |             0 |                 0
(1 row)

Time: 48.042 ms
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > execute direct on (datanode4)$$select * from pg_stat_all_tables where relname ~ 'tbl_partition_1' $$;
  relid  |    schemaname     |     relname     | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup | n_dead_tup | n_mod_since_analyze |          last_vacuum          | last_autovacuum |         last_analyze         | last_autoanalyze | vacuum_count | autovacuum_count | analyze_count | autoanalyze_count
---------+-------------------+--------------------------------------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------+------------+------------+---------------------+-------------------------------+-----------------+------------------------------+------------------+--------------+------------------+---------------+-------------------
 6369512 | dr3_ops_cs36_part | tbl_partition_1 |        0 |            0 |  5813523 |       2928286 |         0 |         0 |         0 |             0 |   11498085 |          0 |                   0 | 2021-01-14 09:47:33.141461+01 | [null]          | 2021-01-14 09:49:35.76404+01 | [null]           |            3 |                0 |             3 |                 0

yazun commented 3 years ago

It happened only for a single dn so far, but from any coordinator.

yazun commented 3 years ago

And bouncing DB does not help.

bethding-database commented 3 years ago

@yazun I try to reproduce this promble, but it hasn't happend yet. I want to confirm some questions: 1、 Whether tbl_partition_1 is partition table or not? 2、Is there "corrupted statistics file" or "out of file descriptors" in datanode4's log? 3、Is it shows that stats collector process exitting when the promble occurs in datanode4's log? And if you have a simple way to reproduce the promble, please let me know. Thanks.

beth-database commented 2 years ago

@yazun Have you solved the problem？Or more clues for it? Do you have a way to reproduce this problem？

kali-brandwatch commented 7 months ago

Rescuing a very old thread just because I landed here in search for answers, I had a very similar issue and I got the erratic results resolved by means of issuing an ANALYZE VERBOSE my_table;

Tencent / TBase

pg_stat_all_tables info not reliable (or direct execute problem). #92