cybertec-postgresql / pgwatch2

PostgreSQL metrics monitor/dashboard
BSD 3-Clause "New" or "Revised" License
1.81k stars 232 forks source link

instance_up check doesn't run if pgwatch2 can't get DB version #788

Closed jasonhill-ds closed 1 month ago

jasonhill-ds commented 4 months ago

I had pgwatch2 monitoring a DB which went down. Was running the instance_up check and alerted when is_up < 1 as below:

SELECT * FROM (
SELECT
  $__timeGroup(time, 2m),
  dbname,
  max(case when data->>'is_up' = '1' then 1 else 0 end) as is_up
FROM
  instance_up
WHERE
  $__timeFilter(time)
GROUP BY 1,2
) x
ORDER BY 1

During this time, I restarted the pgwatch2 container. This appears to have resulted in pgwatch2 losing what version the DB was and therefore not running the instance_up check (meaning no more rows in the table). 2024/05/22 23:08:57 WARN main: Could not find PG version info for DB ******, skipping shutdown check of metric worker process for instance_up

Once the time window that I was running the check over (now-15m to now) had exceeded the time I last had rows in the table, there are no rows returned from the above query for the database. (before then, the This results in the alert being resolved. Setting an alert state of alerting when "no data returned" or "all null" doesn't work around the issue. There are rows returned for other "UP" databases and there is simply no row for the down DB in the returned result set from the above query).

Is the DB version cached in pgwatch2 and then lost if the process is restarted?

jasonhill-ds commented 4 months ago

Docker compose for the pgwatch2 container:

  networks:
  monitoring:

volumes:
  prometheus-data: {}
  grafana-data: {}
  pw2_config: {}
  pw2_postgres: {}
  pw2_grafana: {}

services:
  pgwatch2:
    image: cybertec/pgwatch2-postgres:1.11.0
    container_name: pgwatch2
    ports:
      - '3001:3000'
      - '127.0.0.1:8081:8081'
      - '127.0.0.1:5432:5432'
      - '127.0.0.1:8080:8080'
    restart: unless-stopped
    volumes:
      - pw2_postgres:/var/lib/postgresql
      - pw2_grafana:/var/lib/grafana
      - pw2_config:/pgwatch2/persistent-config
    env_file:
      - pgwatcher2/pgwatcher2.env
    networks:
      - monitoring
pashagolub commented 4 months ago

Hello.

Can you please provide steps how can we reproduce the problem? I'm lost in your explanation, sorry

jasonhill-ds commented 3 months ago

Sorry it has taken so long for me to come back to this: To reproduce:

  1. Set up a pgwatch2 environment using Docker Compose and the Compose file above (or however you like).
  2. Add a PostgreSQL database to be monitored to the environment - Make sure the "instance_up" metric is being collected every 60 seconds.
  3. Wait for some metrics to be gathered for the postgres instance
  4. Set up an alert for Instance_UP as below (can't see a way to export the alert as yaml from pgwatch2 Grafana): Time Range: now-10m to now query to be alerted on (A): SELECT * FROM ( SELECT $__timeGroup(time, 2m), dbname, max(case when data->>'is_up' = '1' then 1 else 0 end) as is_up FROM instance_up WHERE $__timeFilter(time) GROUP BY 1,2 ) x Reduction (B) Operation: Classic condition Conditions: when last of A is below 1 Alert conditions: Expression that will be alerted on: B Evaluate every: 2m for 4m
  5. Shutdown the Postgres Instance that was set up for monitoring in step 2.
  6. Wait for alert to fire.
  7. Note that the alert sql will return 0 when the Postgres instance is down (as expected).
  8. Restart pgwatch2 container
  9. You will note that the Instance_up will not run for the Postgres DB that was stopped. The pgwatch2 logs will report "2024/06/06 23:46:28 WARN main: Could not find PG version info for DB postgres_primary, skipping shutdown check of metric worker process for instance_up"
  10. No rows are added to the pgwatch2 metrics table "instance_up" for this Postgres database..
  11. Wait for the alert time range to expire (10m) and you will see there are no datapoints to run the alert check on.
  12. The alert will clear. pgwatch2 - Instance_up image

image

github-actions[bot] commented 1 month ago

📅 This issue has been automatically marked as stale because lack of recent activity. It will be closed if no further activity occurs. ♻️ If you think there is new information allowing us to address the issue, please reopen it and provide us with updated details. 🤝 Thank you for your contributions.