dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

CouchDB stale replication alerts in WMStats #11918

Open amaltaro opened 7 months ago

amaltaro commented 7 months ago

Impact of the bug WMAgent

Describe the bug Every now and then an agent goes red into WMStats reporting an error for the CouchServer component, which says something like:

    worker thread: undefined
    status: undefined
    last updated: NaN/NaN/NaN (undefined) NaN:NaN:NaN UTC
    pid: undefined
    error message:
    Replication from https://cmsweb.cern.ch/couchdb/workqueue/ to http://localhost:5984/workqueue_inbox/ is stale and it's lastupdate time was at: 1709300428

Most of the times this error goes away in one or two cycles of AgentStatusWatcher (~15min). So we might consider a different logic for detecting and reporting stale database replication in the agent.

How to reproduce it Unclear at the moment.

Expected behavior We should first investigate this further:

and based on that, find a new solution for monitoring couchdb database replication status in the agent and how we report it through AgentStatusWatcher.

Additional context and error message None

vkuznet commented 7 months ago

Alan, I think we need to run a periodic job (either cron on VM or similar on k8s) which will simply check CouchDB replication. According to their document it can be done either via

amaltaro commented 7 months ago

This check is already done by AgentStatusWatcher component, as mentioned in the initial description. But it could be that the logic is either old (from couchdb 1.6.1) or that there is a better way to get to that.

The APIs that you mentioned are likely the way forward, _active_tasks or _scheduler/jobs. But this is a problem for another time.