CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.88k stars 585 forks source link

Exporter Logging Errors for nodemx disk metrics and ccp_pg_stat_statements_reset #2990

Closed kyle-rf closed 1 year ago

kyle-rf commented 2 years ago

Please ensure you do the following when reporting a bug:

Note that some logs needed to troubleshoot may be found in the /pgdata/<CLUSTERNAME>/pg_log directory on your Postgres instance.

An incomplete bug report can lead to delays in resolving the issue or the closing of a ticket, so please be as detailed as possible.

If you are looking for general support, please view the support page for where you can ask questions.

Thanks for reporting the issue, we're looking forward to helping you!

Overview

Upon upgrading PGO to 4.7.4 and upgrading a pg cluster. The metrics container for the cluster began logging errors:

"Error running query on database "127.0.0.1:5432": ccp_pg_stat_statements_reset pq: function monitor.pg_stat_statements_reset_info(integer) does not exist"

"Error running query on database "127.0.0.1:5432": ccp_nodemx_data_disk pq: query-specified return tuple and function return type are not compatible"

Environment

Please provide the following details:

Steps to Reproduce

REPRO

Provide steps to get to the error condition: Have existing Prometheus Deployment Deploy PGO 4.6.2 Do NOT Deploy pgmonitor Create a cluster with metrics Ensure metrics are reaching the existing Prometheus instance should not be generating psql_scrape_errors yet Upgrade PGO to 4.7.4 Upgrade the pg cluster

EXPECTED

No scrape errors were expected. All metrics are going into non-pgmonitor Prometheus.

ACTUAL

Metrics are still reaching the non-pgmonitor Prometheus, but the exporter is now failing to retrieve some metrics with the following errors:

"Error running query on database "127.0.0.1:5432": ccp_pg_stat_statements_reset pq: function monitor.pg_stat_statements_reset_info(integer) does not exist"

"Error running query on database "127.0.0.1:5432": ccp_nodemx_data_disk pq: query-specified return tuple and function return type are not compatible"

Logs

See above

jkatz commented 2 years ago

Did you run ALTER EXTENSION pgnodemx UPDATE; as a superuser in each of your applicable databases?

kyle-rf commented 2 years ago

Did you run ALTER EXTENSION pgnodemx UPDATE; as a superuser in each of your applicable databases?

Thanks for the suggestion. I tried that but got the following error: ERROR: extension "pgnodemx" does not exist

kyle-rf commented 2 years ago

I listed the extensions:

\dx+
     Objects in extension "pg_stat_statements"
                Object description                 
---------------------------------------------------
 function pg_stat_statements(boolean)
 function pg_stat_statements_reset(oid,oid,bigint)
 view pg_stat_statements
(3 rows)

    Objects in extension "pgaudit"
          Object description           
---------------------------------------
 event trigger pgaudit_ddl_command_end
 event trigger pgaudit_sql_drop
 function pgaudit_ddl_command_end()
 function pgaudit_sql_drop()
(4 rows)

      Objects in extension "plpgsql"
            Object description             
-------------------------------------------
 function plpgsql_call_handler()
 function plpgsql_inline_handler(internal)
 function plpgsql_validator(oid)
 language plpgsql
andrewlecuyer commented 2 years ago

@kyle-rf I have been unable to reproduce the behavior you are seeing. Specifically, when I follow the steps you provided (i.e. to upgrade from v4.6.2 to v4.7.4), I see the pgnodemx extension both before and after the upgrade.

And since the upgrade process does not touch the database itself, the extension should definitely be there both before and after the upgrade (assuming metrics was properly enabled/deployed prior to the upgrade). So just to clarify - per the output you provided above, are you seeing the pgnodemx before the PGO and PG cluster upgrade to 4.7.4, but not after?

Additionally, I did want to suggest another approach. Specifically, if you remove metrics from a cluster:

$ pgo update cluster mycluster1 --disable-metrics
Adding or removing a metrics collection sidecar can cause downtime.
WARNING: Are you sure? (yes/no): yes
updated pgcluster mycluster1

...and then re-enable metrics, e.g.:

$ pgo update cluster mycluster1 --enable-metrics
Adding or removing a metrics collection sidecar can cause downtime.
WARNING: Are you sure? (yes/no): yes
updated pgcluster mycluster1

This should resolve the error you are seeing by ensuring all of the latest pgMonitor resources are created and/or updated (with the second step specifically ensuring metrics collection [and therefore the pgnodemx extension] is properly configured and enabled for the cluster).

kyle-rf commented 2 years ago

@andrewlecuyer I gave that a try, but exporter is still logging the error msg="Error running query on database \"127.0.0.1:5432\": ccp_nodemx_data_disk pq: query-specified return tuple and function return type are not compatible"

I have not seen the other error since though, so halfway there.

cristichiru commented 2 years ago

Context

  1. Deployed 4.x PGO and a postgres instance
  2. Upgraded PGO to 5.0.x then 5.1
  3. The new PostgresCluster CR is now using the latest exporter: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-exporter:ubi8-5.1.0-0

Problem

time="2022-05-13T12:19:42Z" level=info msg="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_top_total pq: column s.total_exec_time does not exist" source="postgres_exporter.go:1356"
time="2022-05-13T12:19:42Z" level=info msg="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_top_mean pq: column s.mean_exec_time does not exist" source="postgres_exporter.go:1356"
time="2022-05-13T12:19:42Z" level=info msg="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_total pq: column s.total_exec_time does not exist" source="postgres_exporter.go:1356"
time="2022-05-13T12:19:42Z" level=info msg="Error running query on database \"localhost:5432\": ccp_nodemx_disk_activity pq: query-specified return tuple and function return type are not compatible" source="postgres_exporter.go:1356"
time="2022-05-13T12:19:42Z" level=info msg="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_top_max pq: column s.max_exec_time does not exist" source="postgres_exporter.go:1356"
time="2022-05-13T12:19:42Z" level=info msg="Error running query on database \"localhost:5432\": ccp_nodemx_data_disk pq: query-specified return tuple and function return type are not compatible" source="postgres_exporter.go:1356"
time="2022-05-13T12:19:42Z" level=error msg="queryNamespaceMappings returned 6 errors" source="postgres_exporter.go:1474"

Investigation and workaround

Later I found this thread and noticed ALTER EXTENSION pgnodemx UPDATE;

But isn't the operator supposed to do this?

benjaminjb commented 1 year ago

The https://github.com/CrunchyData/postgres-operator/pull/3400 should resolve these issues and is planned to be available in the next release, but not currently planned to be backpatched to v4. As noted, the manual fix should be to update the extensions used by the exporter, namely pgnodemx and pg_stat_statements.

dzabel commented 1 year ago

@benjaminjb Sorry for writing in a closed issue, but we are seeing similar errors in our operator based openshift installation of crunchy. Operator version 5.2.0. - postgresoperator.v5.2.0.

time="2022-12-08T09:23:56Z" level=info msg="Established new database connection to \"localhost:5432\"." source="postgres_exporter.go:878"
time="2022-12-08T09:23:57Z" level=info msg="Semantic Version Changed on \"localhost:5432\": 0.0.0 -> 14.5.0" source="postgres_exporter.go:1405"
time="2022-12-08T09:24:01Z" level=info msg="Error running query on database \"localhost:5432\": ccp_nodemx_disk_activity pq: query-specified return tuple and function return type are not compatible" source="postgres_exporter.go:1356"
time="2022-12-08T09:24:02Z" level=error msg="queryNamespaceMappings returned 1 errors" source="postgres_exporter.go:1474"
time="2022-12-08T09:24:02Z" level=info msg="Starting Server: :9187" source="postgres_exporter.go:1672"
time="2022-12-08T09:24:03Z" level=info msg="Error running query on database \"localhost:5432\": ccp_nodemx_disk_activity pq: query-specified return tuple and function return type are not compatible" source="postgres_exporter.go:1356"
time="2022-12-08T09:24:03Z" level=error msg="queryNamespaceMappings returned 1 errors" source="postgres_exporter.go:1474"

Should this be fixed in our operator version? I have tried to find out in which operator version the fix is included, but unfortunally i was not able to match the repo tags to an operator version. Maybe you can give me a hint how to match the repo tags to an operator version.

The ALTER EXTENSION pgnodemx UPDATE; is able to fix the issue, but as mentioned before, this should be done by the operator.

benjaminjb commented 1 year ago

Hey @dzabel this won't be fixed in 5.2.0, but should be fixed in 5.3.0, which is coming soon -- but I don't know how soon. I'll drop a note here as soon as it is, just so you know.

benjaminjb commented 1 year ago

Hey, just wanted to drop a line and say that 5.3.0 is out for our open source users -- hope that helps!