dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Remove MonIT wrapper from WMCore #9558

Open vkuznet opened 4 years ago

vkuznet commented 4 years ago

As I mentioned on #9555 the Monit wrapper should not be part of WMCore codebase since monitoring tasks are better to keep in separate repository (CMSMonitoring). This will allow better code maintenance across different groups in CMS.

Unfortunately, the DMWM team rushed into development of their own implementation without any consultation/coordination of this task with CMS Monitoring group whose responsibility is to provide necessary middleware to access MONIT infrastructure. We already struggled (and spent months of effort) to back-porting different implementation of StompAMQ and now will need to repeat the same with Monit wrapper.

There is already monit.py module in place which provides generic access to Monit data-sources, both to ES and InfluxDB, via grafana proxy. We also support different mapping to MONIT datasource IDs.

The DMWM group should provide clear requirements of their needs to access the Monit infrastructure by opening relevant Jira ticket.

I suggest to remove Monit wrapper in WMCore and convert its usage via monit.py from CMSMonitoring.

amaltaro commented 4 years ago

Valentin, in case you want to contribute further on this issue. The use case is the following.

The Site Support team has deprecated the old dashboard http://dashb-ssb.cern.ch/dashboard

where WMAgent used to fetch 3 different metrics (I'm pretty sure there was a JIRA for this...). It's now been made available through a monit-grafana API.

Stephan L. has provided a simple bash script which could be used to query that data and, here is the data/query sent in the POST request:

  -d '{"search_type":"query_then_fetch","ignore_unavailable":true,"index":["monit_prod_cmssst_*"]}
{"size":500,"query":{"bool":{"filter":[{"range":{"metadata.timestamp":{"gte":"now-1d","lte":"now","format":"epoch_millis"}}},{"query_string":{"analyze_wildcard":true,"query":"metadata.type: ssbmetric AND metadata.type_prefix:raw AND metadata.path: sts15min"}}]}},"sort":{"metadata.timestamp":{"order":"desc","unmapped_type":"boolean"}},"script_fields":{},"docvalue_fields":["metadata.timestamp"]}

as you can see, usual parameters for a grafana/ES query, including a fixed number of rows and last 24h of timestamp.

Can you please check your client monit.py script and say whether it would work for this use case? If not, can you make the necessary changes? Once your/CMSMonitoring client is working properly, we can start considering updating WMCore. Thanks

vkuznet commented 4 years ago

Alan, yes it works with current code, e.g.

git clone git@github.com:dmwm/CMSMonitoring.git
cd CMSMonitoring

Then you need to do the following steps

For example

cat > ssb_query.json < EOF
{"size":500,"query":{"bool":{"filter":[{"range":{"metadata.timestamp":{"gte":"now-1d","lte":"now","format":"epoch_millis"}}},{"query_string":{"analyze_wildcard":true,"query":"metadata.type: ssbmetric AND metadata.type_prefix:raw AND metadata.path: sts15min"}}]}},"sort":{"metadata.timestamp":{"order":"desc","unmapped_type":"boolean"}},"script_fields":{},"docvalue_fields":["metadata.timestamp"]}
EOF

# run the script
src/python/CMSMonitoring/monit.py --query=ssb_query.json --dbid=9475 --dbname=monit_prod_cmssst --token=token

# here are the results
{"responses": [{"status": 200, "hits": {"hits": [{"sort": [1584120150000], "_type": "raw_ssbmetric", "_source": {"da
ta": {"status": "enabled", "prod_status": "enabled", "name": "T1_RU_JINR", "detail": "Life: manual override by rmaci
ula (Tier-1s are never put into Waiting Room or Morgue state),\n,\nCrab: 3 Hammer Cloud ok state(s) in 3 days", "manual_life": "enabled", "crab_status": "enabled"}, "metadata": {"kafka_timestamp": 1584120434515, "producer": "cmssst"
, "type_prefix": "raw", "timestamp": 1584120150000, "partition": "9", "topic": "cmssst_raw", "path": "sts15min", "_i
d": "e92aa2ce-b8a4-0efb-4383-3928a4f0e6d9", "type": "ssbmetric"}}, "_score": null, "_index": "monit_prod_cmssst_raw_
ssbmetric-2020-03-13", "fields": {"metadata.timestamp": ["2020-03-13T17:22:30.000Z"]}, "_id": "e92aa2ce-b8a4-0efb-43
83-3928a4f0e6d9"}, {"sort": [1584120150000], "_type": "raw_ssbmetric", "_source": {"data": {"status": "enabled", "pr
od_status": "enabled", "crab_status": "enabled", "name": "T2_FR_IPHC", "detail": "Crab: 3 Hammer Cloud ok state(s) i
n 3 days"}, "metadata": {"kafka_timestamp": 1584120434515, "producer": "cmssst", "type_prefix": "raw", "timestamp":
1584120150000, "partition": "9", "topic": "cmssst_raw", "path": "sts15min", "_id": "df13a52d-d0f7-2be4-4504-7c38b80f
9f7f", "type": "ssbmetric"}}, "_score": null, "_index": "monit_prod_cmssst_raw_ssbmetric-2020-03-13", "fields": {"me
tadata.timestamp": ["2020-03-13T17:22:30.000Z"]}, "_id": "df13a52d-d0f7-2be4-4504-7c38b80f9f7f"}, {"sort": [15841201
50000], "_type": "raw_ssbmetric", "_source": {"data": {"status": "enabled", "prod_status": "enabled", "crab_status":
 "enabled", "name": "T2_UK_London_Brunel", "detail": "Crab: 3 Hammer Cloud ok state(s) in 3 days"}, "metadata": {"ka
fka_timestamp": 1584120434515, "producer": "cmssst", "type_prefix": "raw", "timestamp": 1584120150000, "partition":
"9", "topic": "cmssst_raw", "path": "sts15min", "_id": "7a37e82e-f219-34cc-d003-3fa28d802ef2", "type": "ssbmetric"}}
, "_score": null, "_index": "monit_prod_cmssst_raw_ssbmetric-2020-03-13", "fields": {"metadata.timestamp": ["2020-03
-13T17:22:30.000Z"]}....

Therefore you only need monit.py script which contains everything you need to fetch results. Then you may parse results as required for your application.

And, speaking of Jira, if you're sure that it was a ticket please post it, otherwise please ask or create a ticket. We worked closely with Stephan to make SSB dashboard transition. If something is not working in monit script please open up appropriate Jira ticket.