CouchDB stale replication alerts in WMStats

amaltaro commented 7 months ago

Impact of the bug WMAgent

Describe the bug Every now and then an agent goes red into WMStats reporting an error for the CouchServer component, which says something like:

    worker thread: undefined
    status: undefined
    last updated: NaN/NaN/NaN (undefined) NaN:NaN:NaN UTC
    pid: undefined
    error message:
    Replication from https://cmsweb.cern.ch/couchdb/workqueue/ to http://localhost:5984/workqueue_inbox/ is stale and it's lastupdate time was at: 1709300428

Most of the times this error goes away in one or two cycles of AgentStatusWatcher (~15min). So we might consider a different logic for detecting and reporting stale database replication in the agent.

How to reproduce it Unclear at the moment.

Expected behavior We should first investigate this further:

is replication really stale?
for how long?
any clear reason?

and based on that, find a new solution for monitoring couchdb database replication status in the agent and how we report it through AgentStatusWatcher.

Additional context and error message None

vkuznet commented 7 months ago

Alan, I think we need to run a periodic job (either cron on VM or similar on k8s) which will simply check CouchDB replication. According to their document it can be done either via

curl http://user:password@localhost:5984/_active_tasks

or curl http://user:password@localhost:5984/_scheduler_jobs If you run these commands on any WMA node, e.g. vocms0282, and pass it through jq you'll see plenty of information we can use. For instance, from /_scheduler_jobs API we have pretty decent history

{
  "database": "_replicator",
  "id": "f741590277be7d4df3e8ccf16f00154b+continuous",
  "pid": "<0.29314.1646>",
  "source": "http://localhost:5984/workqueue_inbox/",
  "target": "https://cmsweb.cern.ch/couchdb/workqueue/",
  "user": null,
  "doc_id": "c95213308e4192e72ab945b93b0016bf",
  "info": {
    "revisions_checked": 197375,
    "missing_revisions_found": 184718,
    "docs_read": 184718,
    "docs_written": 184718,
    "changes_pending": 0,
    "doc_write_failures": 0, ...
  },
  "history": [
    {
      "timestamp": "2024-02-26T11:59:44Z",
      "type": "started"
    },
    {
      "timestamp": "2024-02-26T11:59:44Z",
      "type": "crashed",
      "reason": "{http_request_failed,\"PUT\",\n                     \"https://cmsweb.cern.ch/couchdb/workqueue/_local/f741590277be7d4df3e8ccf16f00154b\",\n                     {error,sel_conn_closed}}"
    },
    {
      "timestamp": "2024-02-14T07:41:00Z",
      "type": "started"
    }
    },
    {
      "timestamp": "2024-01-25T12:56:36Z",
      "type": "started"
    },
    {
      "timestamp": "2024-01-25T12:56:36Z",
      "type": "crashed",
      "reason": "{http_request_failed,\"PUT\",\n                     \"https://cmsweb.cern.ch/couchdb/workqueue/_local/f741590277be7d4df3e8ccf16f00154b\",\n                     {error,sel_conn_closed}}"
    },
    {
      "timestamp": "2024-01-24T11:41:57Z",
      "type": "started"
    },
    {
      "timestamp": "2024-01-24T11:41:57Z",
      "type": "crashed",
      "reason": "{http_request_failed,\"GET\",\"https://cmsweb.cern.ch/couchdb/workqueue/\",\n                     {error,{error,req_timedout}}}"
    },
    {
      "timestamp": "2024-01-24T11:11:52Z",
      "type": "started"
    },
    {
      "timestamp": "2024-01-24T11:11:52Z",
      "type": "crashed",
      "reason": "{http_request_failed,\"GET\",\"https://cmsweb.cern.ch/couchdb/workqueue/\",\n                     {error,{error,req_timedout}}}"
    },
    {
      "timestamp": "2024-01-24T10:52:31Z",
      "type": "started"
    },
    {
      "timestamp": "2024-01-24T10:52:31Z",
      "type": "crashed",
      "reason": "{http_request_failed,\"GET\",\"https://cmsweb.cern.ch/couchdb/workqueue/\",\n                     {error,{error,req_timedout}}}"
    },
    {
      "timestamp": "2024-01-24T10:28:44Z",
      "type": "started"
    },
    {
      "timestamp": "2024-01-24T10:28:44Z",
      "type": "crashed",
      "reason": "{http_request_failed,\"GET\",\"https://cmsweb.cern.ch/couchdb/workqueue/\",\n                     {error,{error,req_timedout}}}"
    },
    {
      "timestamp": "2024-01-24T09:49:58Z",
      "type": "started"
    },
    {
      "timestamp": "2024-01-24T09:49:58Z",
      "type": "crashed",
      "reason": "{http_request_failed,\"GET\",\"https://cmsweb.cern.ch/couchdb/workqueue/\",\n                     {error,{error,req_timedout}}}"
    }
  ],
  "node": "couchdb@127.0.0.1",
  "start_time": "2024-01-22T11:24:05Z"
}

In fact, if someone will grab the history and push it to AlertManager we will have appropriate alert on MM (couchdb) channel. What is require is curl or python and amtool. But of course this info will also required proper credentials to couchdb, but I think we can get it by parsing WMA config and do it within script which will be run. I didn't check how often history updates but at least it gives full picture of what happen per each replication document we have in local couch.

amaltaro commented 7 months ago

This check is already done by AgentStatusWatcher component, as mentioned in the initial description. But it could be that the logic is either old (from couchdb 1.6.1) or that there is a better way to get to that.

The APIs that you mentioned are likely the way forward, _active_tasks or _scheduler/jobs. But this is a problem for another time.

dmwm / WMCore

CouchDB stale replication alerts in WMStats #11918