dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Latest stomp library is excessively verbose with regards to the heartbeat feature #11281

Closed amaltaro closed 2 years ago

amaltaro commented 2 years ago

Impact of the bug A dependency in WMCore, CMSMonitoring and CRAB*

Describe the bug This is not really a bug, neither anything to be fixed in WMCore, but here is a GH issue to track this issue with the stomp library, currently in version 7.0.0 in both cmsdist and CMSMonitoring package.

It makes the logs of AgentStatusWatcher extremely verbose with INFO messages like [1]. Central services are using this version as well, so I think the same log pollution is happening over there(!)

How to reproduce it Use the latest CMSMonitoring python package (0.6.7, I think)

Expected behavior We should update those message log level to DEBUG. It's probably a good idea to provide a PR upstream to the stomp project, such that we can update the stomp version in the near future instead of keep applying patches.

Additional context and error message [1]

2022-09-09 20:47:08,879:139964385847040:INFO:listener:Starting heartbeat loop
2022-09-09 20:47:08,880:139965237520128:INFO:transport:attempt reconnection (True, None, 0)
2022-09-09 20:47:08,880:139965237520128:INFO:transport:Attempting connection to host xxx, port xxxx
2022-09-09 20:47:08,880:139965237520128:INFO:transport:Established connection to host xxx, port xxxx
2022-09-09 20:47:08,880:139964377454336:INFO:transport:Starting receiver loop (<Thread(Thread-14, started daemon 139964377454336)>)
2022-09-09 20:47:08,880:139965237520128:INFO:transport:Created thread <Thread(Thread-14, started daemon 139964377454336)> using func <function default_create_thread at 0x7f4c3faa1d30>
2022-09-09 20:47:08,881:139965237520128:INFO:transport:Sending frame: 'STOMP'
2022-09-09 20:47:08,882:139964377454336:INFO:transport:Received frame: 'CONNECTED', len(body)=0
2022-09-09 20:47:08,882:139964369061632:INFO:listener:Starting heartbeat loop
2022-09-09 20:47:08,882:139965237520128:INFO:transport:Sending frame: 'SEND'
mrceyhun commented 2 years ago

@amaltaro I came across this issue. Probably it's the result that you provide your own logging as logger. In our default logger, we give WARNING level as default[1] and we're not getting these verbose logs. I guess the issue can be solved if you provide your logger object with WARNING level. IMO, It's not good logic if we set given logger instance's logging level in StompAMQ7 side (line 189). However, if you think this is a good idea, we can also set logging level to given logger. I explained this because StompAMQ has loglevel parameter but currently it's applied only to default logger object.

[1] https://github.com/dmwm/CMSMonitoring/blob/master/src/python/CMSMonitoring/StompAMQ7.py#L182-L196

amaltaro commented 2 years ago

@mrceyhun let me try to see if the stomp team accepts a PR changing those log records from INFO to DEBUG, if they don't, then I think the best will be for WMCore to stop passing its own logger object. Problem with the latter approach is that it will use the default logger and from time to time I struggle to get them recorded to the right log file.

mrceyhun commented 2 years ago

@amaltaro sorry if I'm taking your time. I checked the stomp.py code base and they use INFO logs in transport.py, but as you said they should be debug logs. I found a workaround on our side. I found that stomp.py has its own logger with its name. I applied this change in our StompAMQ7 : logging.getLogger("stomp.py").setLevel(logging.WARNING) and it worked in my test.

Full test to reproduce the problem:

import logging
# Set log level to INFO for test
logging.basicConfig(level=logging.INFO

import time
from CMSMonitoring.StompAMQ7 import StompAMQ7

# you can read from file too
creds = {
  "username": "",
  "password": "",
  "producer": "cms-training",
  "topic": "/topic/cms.training",
  "host": "cms-test-mb.cern.ch",
  "port": 61323,
  "host_and_ports": "cms-test-mb.cern.ch:61323",
  "cert": "/afs/cern.ch/user/c/cmsmonit/public/.globus/robot-training-cert.pem",
  "ckey": "/afs/cern.ch/user/c/cmsmonit/public/.globus/robot-training-key.pem"
}

# you can functionize this
username = creds.get('username', '')
password = creds.get('password', '')
producer = creds.get('producer')
doc_type = creds.get('type', None)
topic = creds.get('topic')
host = creds.get('host')
port = int(creds.get('port'))
cert = creds.get('cert', None)
ckey = creds.get('ckey', None)

# create stompAMQ instance and give your logging object
stompSvc = StompAMQ7(username=username, password=password, producer=producer, topic=topic, key=ckey, cert=cert,
                     validation_schema=None, host_and_ports=[(host, port)], logger=logging, send_timeout=400)
messages = []
# your documents
docs = [{"value1": "test1", "value2": "test2"}, {"value3": "test3", "value4": "test4"}]
for doc in docs:
    notif, _, _ = stompSvc.make_notification(payload=doc, doc_type=doc_type, ts=int(time.time()))
    messages.append(notif)
stompSvc.send(messages)

This will produce lots of verbose hearbeat(actually stomp.py/stomp/transport.py) logs. When we set logging.getLogger("stomp.py").setLevel(logging.WARNING) in our StompAMQ7 class which will change default logging level of stomp.py, then we will not see these verbose logs.

This is just a workaround, as it is suggested by you, best way is to change upstream repo logging. We can apply this change and I don't see any blocker for that.

amaltaro commented 2 years ago

@mrceyhun thanks for proposing this workaround, Ceyhun. I haven't tested it yet, but will do later in the day and let you know.

amaltaro commented 2 years ago

@mrceyhun indeed your proposed fix does the tricky. I have made a temporary PR here: https://github.com/dmwm/CMSMonitoring/pull/164

such that we can keep track of things while an upstream PR is provided and discussed with the stomp.py developers.

amaltaro commented 2 years ago

@mrceyhun I created this GH issue for stomp.py (including a PR) https://github.com/jasonrbriggs/stomp.py/issues/400

but after checking their code in dev or master branch, there is much more logging records now. I doubt the developers will want to move all of those to debug, so I'd suggest to actually move forward with the CMSMonitoring changes and make a new release available in PiPy. What do you think?

mrceyhun commented 2 years ago

Hi @amaltaro It's just a different point of view to logs. They think they are INFO logs but for you(and also I agree with you), they should be debug log. I have one suggestion, let me put in your CMSMonitoring PR.

amaltaro commented 2 years ago

For the record, it's somehow surprising to me, but it's not affecting central services:

/cephfs/product/dmwm-logs $ grep -r CONNECTED *20220913*
$
mrceyhun commented 2 years ago

For the record, it's somehow surprising to me, but it's not affecting central services:

I don't know these services but may be their logging level is WARNING?

amaltaro commented 2 years ago

A new WMAgent release 2.1.2.patch1 with this CMSMonitoring fix should be available in the next hour. Thanks @mrceyhun for helping out with this investigation and fix.

amaltaro commented 2 years ago

For the record, changes upstream have been accepted: https://github.com/jasonrbriggs/stomp.py/pull/401

and eventually (stomp.py >8.x.y) we can remove the workaround made in CMSMonitoring.

mrceyhun commented 2 years ago

@amaltaro Great to hear that!

Last time v8.0.0 was not working well, there was some problems that i could not remember now. I will check the latest version and update our StompAMQ module accordingly. If it is not urgent, I will request a couple of days from you to implement it :)

amaltaro commented 2 years ago

@mrceyhun I actually think we should wait for a couple of months before trying to upgrade stomp.py version. Just to make sure those changes get integrated into whatever is provided over PyPi, and also to let this/these latest versions to mature a bit more.

mrceyhun commented 2 years ago

Okay @amaltaro , makes sense