Open chrisburr opened 3 hours ago
can we patch result["closeTransport"]
with result.get("closeTransport", None)
?
_processProposal
isn't guaranteed to set closeTransport
at all if the socket fails to send initially
but
Another case. If I use py-spy --nonblocking
it doesn't revive the process but if I don't include that flag it spits out this error:
2024-10-23 07:24:59 UTC DataManagement/FTS3Agent/treatOperation/9186766 INFO: FTS3Operation 9186766: Submitted job for 1 transfers
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent/submit/9186784/CERN_MC-DST-EOS_IN2P3-ARCHIVE INFO: Got GUID e90bc34c-910f-11ef-a5b7-fa163e0a22e3
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26805295d0> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26c0268790> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f27405f1c10> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26e00de9d0> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f274053f550> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26c00242d0> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f2740515750> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f2754211750> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26806edb10> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent/submit/9186782/GRIDKA_MC-DST_CNAF-ARCHIVE INFO: Got GUID e903e244-910f-11ef-bcef-fa163e0a22e3
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent/treatOperation/9186784 INFO: FTS3Operation 9186784: Submitted job for 2 transfers
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent/treatOperation/9186782 INFO: FTS3Operation 9186782: Submitted job for 1 transfers
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f2700160350> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
It looks like FTS3Agent is still stuck, with 48 threads waiting to acquire this lock:
And the reason they're waiting is because a 49th thread is stuck in this state:
Thread 774821 (idle): "Thread-81 (worker)"
send (stomp/transport.py:639)
Arguments:
self: <Transport at 0x7f275dca8150>
encoded_frame: <bytes at 0x7f2740790cb0>
transmit (stomp/transport.py:282)
Arguments:
self: <Transport at 0x7f275dca8150>
frame: <Frame at 0x7f275cd3ed10>
Locals:
listeners: [("protocol-listener", <StompConnection11 at 0x7f275dcaa510>)]
_: "protocol-listener"
listener: <StompConnection11 at 0x7f275dcaa510>
lines: [<bytes at 0x7f27405e3bd0>, <bytes at 0x55adc06cd5b0>, <bytes at 0x7f26e009f630>, <bytes at 0x7f26e0784b20>, <bytes at 0x55adc06cd5b0>, <bytes at 0x7f27406e23a0>, <bytes at 0x7f275cb43720>]
packed_frame: <bytes at 0x7f2740790cb0>
send_frame (stomp/protocol.py:248)
Arguments:
self: <StompConnection11 at 0x7f275dcaa510>
cmd: "SEND"
headers: {"destination": "/queue/lhcb.dirac.logging", "content-length": 298}
body: "{\"levelname\": \"INFO\", \"message\": \"Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7\", \"asctime\": \"2024-10-23 11:06:01,727334\", \"componentname\": \"DataManagement/FTS3Agent\", \"varmessage\": \"\", \"customname\": \"\", \"local_name\": \"submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST\", \"hostname\": \"lbvobox303.cern.ch\"}"
Locals:
frame: <Frame at 0x7f275cd3ed10>
send (stomp/protocol.py:396)
Arguments:
self: <StompConnection11 at 0x7f275dcaa510>
destination: "/queue/lhcb.dirac.logging"
body: "{\"levelname\": \"INFO\", \"message\": \"Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7\", \"asctime\": \"2024-10-23 11:06:01,727334\", \"componentname\": \"DataManagement/FTS3Agent\", \"varmessage\": \"\", \"customname\": \"\", \"local_name\": \"submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST\", \"hostname\": \"lbvobox303.cern.ch\"}"
content_type: None
headers: {"destination": "/queue/lhcb.dirac.logging", "content-length": 298}
Locals:
keyword_headers: {}
put (DIRAC/Resources/MessageQueue/StompMQConnector.py:166)
Arguments:
self: <StompMQConnector at 0x7f275dc5c410>
message: {"levelname": "INFO", "message": "Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7", "asctime": "2024-10-23 11:06:01,727334", "componentname": "DataManagement/FTS3Agent", "varmessage": "", "customname": "", "local_name": "submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST", "hostname": "lbvobox303.cern.ch"}
parameters: {"destination": "/queue/lhcb.dirac.logging"}
Locals:
log: <Logging at 0x7f275c275090>
destination: "/queue/lhcb.dirac.logging"
put (DIRAC/Resources/MessageQueue/MQProducer.py:20)
Arguments:
self: <MQProducer at 0x7f275dc64b50>
msg: {"levelname": "INFO", "message": "Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7", "asctime": "2024-10-23 11:06:01,727334", "componentname": "DataManagement/FTS3Agent", "varmessage": "", "customname": "", "local_name": "submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST", "hostname": "lbvobox303.cern.ch"}
Locals:
result: {"OK": True, "Value": <StompMQConnector at 0x7f275dc5c410>}
connector: <StompMQConnector at 0x7f275dc5c410>
emit (DIRAC/FrameworkSystem/private/standardLogging/Handler/MessageQueueHandler.py:49)
Arguments:
self: <MessageQueueHandler at 0x7f275e0dc8d0>
record: <LogRecord at 0x7f26e00ae890>
Locals:
strRecord: "{\"levelname\": \"INFO\", \"message\": \"Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7\", \"asctime\": \"2024-10-23 11:06:01,727334\", \"componentname\": \"DataManagement/FTS3Agent\", \"varmessage\": \"\", \"customname\": \"\", \"local_name\": \"submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST\", \"hostname\": \"lbvobox303.cern.ch\"}"
handle (logging/__init__.py:978)
Arguments:
self: <MessageQueueHandler at 0x7f275e0dc8d0>
record: <LogRecord at 0x7f26e00ae890>
Locals:
rv: True
callHandlers (logging/__init__.py:1706)
Arguments:
self: <Logger at 0x7f275edd3e90>
record: <LogRecord at 0x7f26e00ae890>
Locals:
c: <Logger at 0x7f275edd3e90>
found: 2
hdlr: <MessageQueueHandler at 0x7f275e0dc8d0>
handle (logging/__init__.py:1644)
Arguments:
self: <Logger at 0x7f275edd3e90>
record: <LogRecord at 0x7f26e00ae890>
_log (logging/__init__.py:1634)
Arguments:
self: <Logger at 0x7f275edd3e90>
level: 20
msg: "%s"
args: ("Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7")
exc_info: False
extra: {"componentname": "DataManagement/FTS3Agent", "varmessage": "", "spacer": "", "customname": "", "headerIsShown": True, "timeStampIsShown": True, "contextIsShown": True, "threadIDIsShown": False, "color": True, "local_name": "submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST"}
stack_info: False
stacklevel: 1
Locals:
sinfo: None
fn: "(unknown file)"
lno: 0
func: "(unknown function)"
record: <LogRecord at 0x7f26e00ae890>
log (logging/__init__.py:1559)
Arguments:
self: <Logger at 0x7f275edd3e90>
level: 20
msg: "%s"
Locals:
args: ("Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7")
kwargs: {"exc_info": False, "extra": {"componentname": "DataManagement/FTS3Agent", "varmessage": "", "spacer": "", "customname": "", "headerIsShown": True, "timeStampIsShown": True, "contextIsShown": True, "threadIDIsShown": False, "color": True, "local_name": "submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST"}}
_createLogRecord (DIRAC/FrameworkSystem/private/standardLogging/Logging.py:421)
Arguments:
....
I think the problem is that we need to set a timeout in StompMQConnector.setupConnection
.
I think it can be set here as timeout
key
the alternative is "heartbeats" https://jasonrbriggs.github.io/stomp.py/api.html#dealing-with-disconnects
In LHCb we're seeing many services stuck. If I attach py-spy the spit out a bunch of errors and then start working again: