Open vkuznet opened 8 months ago
@vkuznet @amaltaro I want to make one really important remark, which we should always keep in our heads while working on that.
@todor-ivanov , the MONIT infrastructure has "public" (open to CERN network) and pure "private" (open to restricted list) channels. The former is where we store our Monitoring metrics/records, while latter where our kubernetes log entries go and it has restricted access. We don't need to hide anything as the content will be restricted within CERN network, i.e. it does not have public internet access, it will be only visible to CMS, and if we want it will only visible to a specific e-groups, users.
Impact of the new feature Uniform, centralized logging system can bring many benefits to data operations, debugging, and monitoring of various WM services, components, workflows, etc.
Is your feature request related to a problem? Please describe. At the moment we have de-centralized, non-overlapping logging solutions, like LogDB, file logs within WMAgents, Component logs for WM services, etc., which is very tedious to navigate and require different access patterns. From end-user point of view, e.g. data-ops, it is very cumbersome to navigate and find specific information about different topics.
Describe the solution you'd like We may adopt CMS Monitoring system based on CERN brokers (AMQ) and MONIT backend using Elastic/OpenSearch and HDFS for storing semi-structured (JSON) documents. Each log entry can be represented as JSON document and injected to central service, similar to WMArchvie, which can proxy it to MONIT backend. Here is detailed plan for such system:
{"service": SERVICE_NAME, "message": MESSAGE, "code": CODE, "timestamp": UNIX_TIME, "application": APPLICATION}
Describe alternatives you've considered Do nothing and use existing system
Additional context To implement this solution we may required the following:
The migration can be done gradually without disruption of existing logging solutions, and even we may try individual service/application to start with and verify whole workflow. If successful, start gradually adding new services/applications.
Some training may be required for end-users to get use to Elastic/OpenSearch queries. Additional CLI interface can be developed too via proxy access to MONIT.