dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
45 stars 106 forks source link

Push MSTransferor monitoring data to MonIT #9651

Open amaltaro opened 4 years ago

amaltaro commented 4 years ago

Impact of the new feature WMCore MSTransferor

Is your feature request related to a problem? Please describe. Besides some high level overview of the service, through the REST API, the only other way to see how the service is behaving is through its logs. So we need to have something better in place, such that anyone can have a better picture of how the service is behaving.

Describe the solution you'd like We can probably use these metrics defined/discussed in this GH issue: https://github.com/dmwm/WMCore/issues/9528

and feed them with StompAMQ into the MonIT infrastructure.

In addition to those metrics, it might be useful as well to push the storage quotas to grafana. We could:

We should also discuss any other metrics that might make sense to monitor.

Describe alternatives you've considered

Additional context We can use the same ES index already in use by the other central services, like ReqMgr2 and Global Workqueue.

vkuznet commented 4 years ago

Alan, for this and another PR #9651 you'll need to request new MONIT end-points. For that please open up CMSMONIT Jira ticket and we'll propagate it to the CERN MONIT team. You'll need to provide the following information though:

amaltaro commented 4 years ago

@vkuznet Valentin, I was planning to use the same topic/credentials as we use for ReqMgr2. The load isn't that big (I guess a few thousand documents every 5min or so). Do you see any problem with that approach?

vkuznet commented 4 years ago

The problem with the same topic is that you'll end up with different doc structures, one represents ReqMgr2 and another MStransferor/MSMonit. If the schema is the same it is ok, if not I suggest creating different topics. It is related to internal schema of ES used within the topic and therefore with the ability to look-up/plot the data.

amaltaro commented 4 years ago

I think having different doc structures is inevitable in this case. Reason being that we will be pushing different type of information, so different keys should be expected in those documents.

For instance, I'd like to push quota and usage for every single RSE/PNN. This will either be: a) a single doc like: [{'RSEName': 'SiteA', 'Quota': 123, 'Usage':1332}, {...etc}] b) or multiple documents like 'RSEName': 'blah', 'Quota': 123

Then we have metrics for the component, and here I don't know if we can put all of them in the same document, or whether we will have to break them down.

vkuznet commented 4 years ago

how about creating a schema that will be suitable for both (multiple) use-cases, e.g. in one case some attributes can have "zero" values, i.e. be empty, in another they will be filled in. Doing this way you have consistent schema and only supply different information. As I wrote, MONIT team expects that within a single topic all docs are alike because they need to create a schema for ES. The reason for having the same schema is simple, it is queries which should apply to all docs and if docs do not provide certain fields the queries may fail.

vkuznet commented 4 years ago

and, keep in mind that injection simple docs, rather nested structures is preferable since it will allow to aggregated in ES more easily. You should view your task not from an injection point of view but rather how you'll look-up back your data (and if you want to aggregate them).

amaltaro commented 4 years ago

Valentin, I'm not sure I understood your suggestion. Are you saying that we could create a schema that would be the union of all possible keys that this service would send to MonIT. And those not needed in a given document, are just left out with a default value (0 or whatever)?

If so, yes, we do have an schema for documents posted, but we also have a bunch of unneeded information getting transferred through network, stored, and adding complexity to a future navigation through the documents.

Perhaps it's better to discuss if we have something more real. Here are 3 documents that I just came up with (that will be good candidates to be published):

For site/RSE quota (around 100 docs every 10min):

 {'agent_url': 'reqmgr2_ms_transferor',
  'type': 'rse_quota',
  'rse_name': 'BLAH',
  'bytes_quota': 1234,
  'bytes_used': 1234,
  'bytes_remaining': 1234,
  'rse_status': 'available | out-of-quota'},

an overview of the component (1 document every 10min):

 {'agent_url': 'reqmgr2_ms_transferor',
  'type': 'service_summary',
  'service_status': 'OK | Error',
  'microservice_version': '1.3.2.pre5',
  'wmcore_version': '1.3.2.pre5',
  'execution_time': 12.45,
  'success_request_transition': 10,
  'failed_request_transition': 10,
  'total_num_requests': 10,
  'total_num_campaigns': 155
},

an overview of the data transfers requested in the cycle (1 document every 10min):

 {'agent_url': 'reqmgr2_ms_transferor',
  'type': 'data_transfers',
  'num_datasets_subscribed': 155,
  'bytes_datasets_subscribed': 155,
  'num_blocks_subscribed': 155,
  'bytes_blocks_subscribed': 155,
},

As you can see, only agent_url and type would be in common. All the rest is specific to the type of the information/document that we want to post. Please let us know your thought.

vkuznet commented 4 years ago

Alan, to give you more feedback I need to know more:

If aggregation is not required across docs then it is better to keep them in separate collections/topics. While if you plan to aggregate across different structures then it is better to consult with MONIT team and ask their opinion.