Centralize logging system for WM

vkuznet commented 8 months ago

Impact of the new feature Uniform, centralized logging system can bring many benefits to data operations, debugging, and monitoring of various WM services, components, workflows, etc.

Is your feature request related to a problem? Please describe. At the moment we have de-centralized, non-overlapping logging solutions, like LogDB, file logs within WMAgents, Component logs for WM services, etc., which is very tedious to navigate and require different access patterns. From end-user point of view, e.g. data-ops, it is very cumbersome to navigate and find specific information about different topics.

Describe the solution you'd like We may adopt CMS Monitoring system based on CERN brokers (AMQ) and MONIT backend using Elastic/OpenSearch and HDFS for storing semi-structured (JSON) documents. Each log entry can be represented as JSON document and injected to central service, similar to WMArchvie, which can proxy it to MONIT backend. Here is detailed plan for such system:

Logs can be represented as JSON document, i.e. we may have uniform logging format, e.g. {"service": SERVICE_NAME, "message": MESSAGE, "code": CODE, "timestamp": UNIX_TIME, "application": APPLICATION}
Request MONIT topics, e.g. WMLogs, within MONIT AMQ brokers and two backend streams ElasticSearch and HDFS areas. The former can use short retention policy, e.g. 1 month, the latter can hold logs up to 13 months.
Setup CMSAMQProxy (generalization of WMArchive) service on CMSWEB to be a proxy between clients and MONIT infrastructure; we already have its deployment for CMSWEB k8s infrastructure, docker and k8s manifest
Logs can be injected from distributed on and off-site locations in a similar fashion as we inject monitoring information, e.g. WMArchive. We can either use WMArchive code base to injects logs (JSON docs) or develop stand-along Logger class to use instead of python logger. If latter, it can log to either file based sink or to HTTP CMSWEB AMQ proxy or both.
Gradually patch various WM systems, services to inject new logs into MONIT
Logs in Elastic/OpenSearch can be viewed in browser, they can be queried and sliced using elastic search QueryLanguage, e.g. we may have predefined queries for different use-case, e.g. how to search for workflow evolution, or find specific error code, etc.
Logs can be used for additional monitoring purposes, e.g. watch workflow progress.
It is possible to have subscription channels where someone or application can subscribe to MONIT (kafka) topics, similar to what we have with Rucio traces

Describe alternatives you've considered Do nothing and use existing system

Additional context To implement this solution we may required the following:

perform evaluation of total log volume
- log volume per individual service/application
- rate of injection
discuss and establish log structure (see above proposal)
setup CMSAMQProxy and perform integration with MONIT
- identify number of replicas for CMSAMQProxy based on evaluated log volume and injection rates
- setup MONIT topic and retention policies
- perform integration between WM application/service and CMSAMQProxy and MONIT
patch existing services to either use new injectors or class which will handle logging injection
manage authentication, current one is based on X509 if we'll setup CMSAMQProxy on CMSWEB and later should be adopted token based authentication

The migration can be done gradually without disruption of existing logging solutions, and even we may try individual service/application to start with and verify whole workflow. If successful, start gradually adding new services/applications.

Some training may be required for end-users to get use to Elastic/OpenSearch queries. Additional CLI interface can be developed too via proxy access to MONIT.

todor-ivanov commented 8 months ago

@vkuznet @amaltaro I want to make one really important remark, which we should always keep in our heads while working on that.

Many of these logs contain sensitive information. And we must pay attention and invest additional effort for this information to be blurred before uploading the contents to a public repository!

vkuznet commented 8 months ago

@todor-ivanov , the MONIT infrastructure has "public" (open to CERN network) and pure "private" (open to restricted list) channels. The former is where we store our Monitoring metrics/records, while latter where our kubernetes log entries go and it has restricted access. We don't need to hide anything as the content will be restricted within CERN network, i.e. it does not have public internet access, it will be only visible to CMS, and if we want it will only visible to a specific e-groups, users.

dmwm / WMCore

Centralize logging system for WM #11929