Open amaltaro opened 7 months ago
Alan, I think we need to run a periodic job (either cron on VM or similar on k8s) which will simply check CouchDB replication. According to their document it can be done either via
curl http://user:password@localhost:5984/_active_tasks
curl http://user:password@localhost:5984/_scheduler_jobs
If you run these commands on any WMA node, e.g. vocms0282, and pass it through jq
you'll see plenty of information we can use. For instance, from /_scheduler_jobs
API we have pretty decent history
{
"database": "_replicator",
"id": "f741590277be7d4df3e8ccf16f00154b+continuous",
"pid": "<0.29314.1646>",
"source": "http://localhost:5984/workqueue_inbox/",
"target": "https://cmsweb.cern.ch/couchdb/workqueue/",
"user": null,
"doc_id": "c95213308e4192e72ab945b93b0016bf",
"info": {
"revisions_checked": 197375,
"missing_revisions_found": 184718,
"docs_read": 184718,
"docs_written": 184718,
"changes_pending": 0,
"doc_write_failures": 0, ...
},
"history": [
{
"timestamp": "2024-02-26T11:59:44Z",
"type": "started"
},
{
"timestamp": "2024-02-26T11:59:44Z",
"type": "crashed",
"reason": "{http_request_failed,\"PUT\",\n \"https://cmsweb.cern.ch/couchdb/workqueue/_local/f741590277be7d4df3e8ccf16f00154b\",\n {error,sel_conn_closed}}"
},
{
"timestamp": "2024-02-14T07:41:00Z",
"type": "started"
}
},
{
"timestamp": "2024-01-25T12:56:36Z",
"type": "started"
},
{
"timestamp": "2024-01-25T12:56:36Z",
"type": "crashed",
"reason": "{http_request_failed,\"PUT\",\n \"https://cmsweb.cern.ch/couchdb/workqueue/_local/f741590277be7d4df3e8ccf16f00154b\",\n {error,sel_conn_closed}}"
},
{
"timestamp": "2024-01-24T11:41:57Z",
"type": "started"
},
{
"timestamp": "2024-01-24T11:41:57Z",
"type": "crashed",
"reason": "{http_request_failed,\"GET\",\"https://cmsweb.cern.ch/couchdb/workqueue/\",\n {error,{error,req_timedout}}}"
},
{
"timestamp": "2024-01-24T11:11:52Z",
"type": "started"
},
{
"timestamp": "2024-01-24T11:11:52Z",
"type": "crashed",
"reason": "{http_request_failed,\"GET\",\"https://cmsweb.cern.ch/couchdb/workqueue/\",\n {error,{error,req_timedout}}}"
},
{
"timestamp": "2024-01-24T10:52:31Z",
"type": "started"
},
{
"timestamp": "2024-01-24T10:52:31Z",
"type": "crashed",
"reason": "{http_request_failed,\"GET\",\"https://cmsweb.cern.ch/couchdb/workqueue/\",\n {error,{error,req_timedout}}}"
},
{
"timestamp": "2024-01-24T10:28:44Z",
"type": "started"
},
{
"timestamp": "2024-01-24T10:28:44Z",
"type": "crashed",
"reason": "{http_request_failed,\"GET\",\"https://cmsweb.cern.ch/couchdb/workqueue/\",\n {error,{error,req_timedout}}}"
},
{
"timestamp": "2024-01-24T09:49:58Z",
"type": "started"
},
{
"timestamp": "2024-01-24T09:49:58Z",
"type": "crashed",
"reason": "{http_request_failed,\"GET\",\"https://cmsweb.cern.ch/couchdb/workqueue/\",\n {error,{error,req_timedout}}}"
}
],
"node": "couchdb@127.0.0.1",
"start_time": "2024-01-22T11:24:05Z"
}
In fact, if someone will grab the history and push it to AlertManager we will have appropriate alert on MM (couchdb) channel. What is require is curl or python and amtool. But of course this info will also required proper credentials to couchdb, but I think we can get it by parsing WMA config and do it within script which will be run. I didn't check how often history updates but at least it gives full picture of what happen per each replication document we have in local couch.
This check is already done by AgentStatusWatcher component, as mentioned in the initial description. But it could be that the logic is either old (from couchdb 1.6.1) or that there is a better way to get to that.
The APIs that you mentioned are likely the way forward, _active_tasks
or _scheduler/jobs
. But this is a problem for another time.
Impact of the bug WMAgent
Describe the bug Every now and then an agent goes red into WMStats reporting an error for the CouchServer component, which says something like:
Most of the times this error goes away in one or two cycles of AgentStatusWatcher (~15min). So we might consider a different logic for detecting and reporting stale database replication in the agent.
How to reproduce it Unclear at the moment.
Expected behavior We should first investigate this further:
and based on that, find a new solution for monitoring couchdb database replication status in the agent and how we report it through AgentStatusWatcher.
Additional context and error message None