DIRACGrid / DIRAC

DIRAC Grid
http://diracgrid.org
GNU General Public License v3.0
114 stars 175 forks source link

Services stuck #7851

Open chrisburr opened 3 hours ago

chrisburr commented 3 hours ago

In LHCb we're seeing many services stuck. If I attach py-spy the spit out a bunch of errors and then start working again:

2024-10-23 07:35:36 UTC WorkloadManagement/JobMonitoring NOTICE: Executing action ([2001:1458:d03:3::100:319]:37772)[lhcb_user:menghao] RPC/getJobsStates(<masked>)
2024-10-23 07:35:36 UTC WorkloadManagement/JobMonitoring NOTICE: Returning response ([2001:1458:d03:3::100:319]:37772)[lhcb_user:menghao] (0.04 secs) OK
2024-10-23 07:35:40 UTC WorkloadManagement/JobMonitoring NOTICE: Executing action ([2001:1458:d00:2f::100:1fb]:35594)[hosts:lbvobox301.cern.ch] RPC/getJobsStatus(<masked>)
2024-10-23 07:35:40 UTC WorkloadManagement/JobMonitoring NOTICE: Returning response ([2001:1458:d00:2f::100:1fb]:35594)[hosts:lbvobox301.cern.ch] (0.01 secs) OK
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring WARN: Sleeping as service requested throttling WorkloadManagement/JobMonitoring
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring ERROR: Error processing proposal Error in _write: (110, 'Connection timed out') SSLError(110, 'Connection timed out')
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring ERROR: Exception in handleConnection thread
Traceback (most recent call last):
  File "/opt/dirac/versions/v11.0.49-1729166531/Linux-x86_64/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dirac/versions/v11.0.49-1729166531/Linux-x86_64/lib/python3.11/site-packages/DIRAC/Core/DISET/private/Service.py", line 351, in _processInThread
    if result["closeTransport"] or not result["OK"]:
       ~~~~~~^^^^^^^^^^^^^^^^^^
KeyError: 'closeTransport'
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring ERROR: Exception in handleConnection thread
Traceback (most recent call last):
  File "/opt/dirac/versions/v11.0.49-1729166531/Linux-x86_64/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dirac/versions/v11.0.49-1729166531/Linux-x86_64/lib/python3.11/site-packages/DIRAC/Core/DISET/private/Service.py", line 351, in _processInThread
    if result["closeTransport"] or not result["OK"]:
       ~~~~~~^^^^^^^^^^^^^^^^^^
KeyError: 'closeTransport'
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring ERROR: Exception in handleConnection thread
Traceback (most recent call last):
  File "/opt/dirac/versions/v11.0.49-1729166531/Linux-x86_64/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dirac/versions/v11.0.49-1729166531/Linux-x86_64/lib/python3.11/site-packages/DIRAC/Core/DISET/private/Service.py", line 351, in _processInThread
    if result["closeTransport"] or not result["OK"]:
       ~~~~~~^^^^^^^^^^^^^^^^^^
KeyError: 'closeTransport'
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring ERROR: Exception in handleConnection thread
Traceback (most recent call last):
  File "/opt/dirac/versions/v11.0.49-1729166531/Linux-x86_64/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dirac/versions/v11.0.49-1729166531/Linux-x86_64/lib/python3.11/site-packages/DIRAC/Core/DISET/private/Service.py", line 351, in _processInThread
    if result["closeTransport"] or not result["OK"]:
       ~~~~~~^^^^^^^^^^^^^^^^^^
KeyError: 'closeTransport'
... more of the same
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring WARN: Error while accepting a connection:  Error in acceptConnection: [Errno 107] Transport endpoint is not connected OSError(107, 'Transport endpoint is not connected')
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring WARN: Error while accepting a connection:  Error in acceptConnection: [Errno 107] Transport endpoint is not connected OSError(107, 'Transport endpoint is not connected')
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring WARN: Error while accepting a connection:  Error in acceptConnection: [Errno 107] Transport endpoint is not connected OSError(107, 'Transport endpoint is not connected')
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring WARN: Error while accepting a connection:  Error in acceptConnection: [Errno 107] Transport endpoint is not connected OSError(107, 'Transport endpoint is not connected')
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring NOTICE: Executing action ([2001:1458:d00:68::100:468]:55134)[hosts:lbvobox900.cern.ch] RPC/getJobsStatus(<masked>)
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring NOTICE: Returning response ([2001:1458:d00:68::100:468]:55134)[hosts:lbvobox900.cern.ch] (0.00 secs) OK
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring NOTICE: Executing action ([2001:1458:d00:68::100:468]:55144)[hosts:lbvobox900.cern.ch] RPC/getJobsStatus(<masked>)
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring NOTICE: Returning response ([2001:1458:d00:68::100:468]:55144)[hosts:lbvobox900.cern.ch] (0.01 secs) OK
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring NOTICE: Executing action ([2001:1458:d00:2f::100:1fb]:51200)[hosts:lbvobox301.cern.ch] RPC/getJobParameters(<masked>)
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring NOTICE: Returning response ([2001:1458:d00:2f::100:1fb]:51200)[hosts:lbvobox301.cern.ch] (0.03 secs) OK
ryuwd commented 3 hours ago

can we patch result["closeTransport"] with result.get("closeTransport", None)?

_processProposal isn't guaranteed to set closeTransport at all if the socket fails to send initially

https://github.com/DIRACGrid/DIRAC/blob/integration/src/DIRAC/Core/DISET/private/Service.py#L349-L354

but

https://github.com/DIRACGrid/DIRAC/blob/integration/src/DIRAC/Core/DISET/private/Service.py#L510-L515

chrisburr commented 3 hours ago

Another case. If I use py-spy --nonblocking it doesn't revive the process but if I don't include that flag it spits out this error:

2024-10-23 07:24:59 UTC DataManagement/FTS3Agent/treatOperation/9186766 INFO: FTS3Operation 9186766: Submitted job for 1 transfers
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent/submit/9186784/CERN_MC-DST-EOS_IN2P3-ARCHIVE INFO: Got GUID e90bc34c-910f-11ef-a5b7-fa163e0a22e3
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26805295d0> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26c0268790> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f27405f1c10> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26e00de9d0> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f274053f550> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26c00242d0> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f2740515750> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f2754211750> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26806edb10> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent/submit/9186782/GRIDKA_MC-DST_CNAF-ARCHIVE INFO: Got GUID e903e244-910f-11ef-bcef-fa163e0a22e3
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent/treatOperation/9186784 INFO: FTS3Operation 9186784: Submitted job for 2 transfers
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent/treatOperation/9186782 INFO: FTS3Operation 9186782: Submitted job for 1 transfers
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f2700160350> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
chrisburr commented 2 hours ago

It looks like FTS3Agent is still stuck, with 48 threads waiting to acquire this lock:

https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/FrameworkSystem/private/standardLogging/Logging.py#L398

chrisburr commented 2 hours ago

And the reason they're waiting is because a 49th thread is stuck in this state:

Thread 774821 (idle): "Thread-81 (worker)"
    send (stomp/transport.py:639)
        Arguments:
            self: <Transport at 0x7f275dca8150>
            encoded_frame: <bytes at 0x7f2740790cb0>
    transmit (stomp/transport.py:282)
        Arguments:
            self: <Transport at 0x7f275dca8150>
            frame: <Frame at 0x7f275cd3ed10>
        Locals:
            listeners: [("protocol-listener", <StompConnection11 at 0x7f275dcaa510>)]
            _: "protocol-listener"
            listener: <StompConnection11 at 0x7f275dcaa510>
            lines: [<bytes at 0x7f27405e3bd0>, <bytes at 0x55adc06cd5b0>, <bytes at 0x7f26e009f630>, <bytes at 0x7f26e0784b20>, <bytes at 0x55adc06cd5b0>, <bytes at 0x7f27406e23a0>, <bytes at 0x7f275cb43720>]
            packed_frame: <bytes at 0x7f2740790cb0>
    send_frame (stomp/protocol.py:248)
        Arguments:
            self: <StompConnection11 at 0x7f275dcaa510>
            cmd: "SEND"
            headers: {"destination": "/queue/lhcb.dirac.logging", "content-length": 298}
            body: "{\"levelname\": \"INFO\", \"message\": \"Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7\", \"asctime\": \"2024-10-23 11:06:01,727334\", \"componentname\": \"DataManagement/FTS3Agent\", \"varmessage\": \"\", \"customname\": \"\", \"local_name\": \"submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST\", \"hostname\": \"lbvobox303.cern.ch\"}"
        Locals:
            frame: <Frame at 0x7f275cd3ed10>
    send (stomp/protocol.py:396)
        Arguments:
            self: <StompConnection11 at 0x7f275dcaa510>
            destination: "/queue/lhcb.dirac.logging"
            body: "{\"levelname\": \"INFO\", \"message\": \"Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7\", \"asctime\": \"2024-10-23 11:06:01,727334\", \"componentname\": \"DataManagement/FTS3Agent\", \"varmessage\": \"\", \"customname\": \"\", \"local_name\": \"submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST\", \"hostname\": \"lbvobox303.cern.ch\"}"
            content_type: None
            headers: {"destination": "/queue/lhcb.dirac.logging", "content-length": 298}
        Locals:
            keyword_headers: {}
    put (DIRAC/Resources/MessageQueue/StompMQConnector.py:166)
        Arguments:
            self: <StompMQConnector at 0x7f275dc5c410>
            message: {"levelname": "INFO", "message": "Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7", "asctime": "2024-10-23 11:06:01,727334", "componentname": "DataManagement/FTS3Agent", "varmessage": "", "customname": "", "local_name": "submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST", "hostname": "lbvobox303.cern.ch"}
            parameters: {"destination": "/queue/lhcb.dirac.logging"}
        Locals:
            log: <Logging at 0x7f275c275090>
            destination: "/queue/lhcb.dirac.logging"
    put (DIRAC/Resources/MessageQueue/MQProducer.py:20)
        Arguments:
            self: <MQProducer at 0x7f275dc64b50>
            msg: {"levelname": "INFO", "message": "Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7", "asctime": "2024-10-23 11:06:01,727334", "componentname": "DataManagement/FTS3Agent", "varmessage": "", "customname": "", "local_name": "submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST", "hostname": "lbvobox303.cern.ch"}
        Locals:
            result: {"OK": True, "Value": <StompMQConnector at 0x7f275dc5c410>}
            connector: <StompMQConnector at 0x7f275dc5c410>
    emit (DIRAC/FrameworkSystem/private/standardLogging/Handler/MessageQueueHandler.py:49)
        Arguments:
            self: <MessageQueueHandler at 0x7f275e0dc8d0>
            record: <LogRecord at 0x7f26e00ae890>
        Locals:
            strRecord: "{\"levelname\": \"INFO\", \"message\": \"Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7\", \"asctime\": \"2024-10-23 11:06:01,727334\", \"componentname\": \"DataManagement/FTS3Agent\", \"varmessage\": \"\", \"customname\": \"\", \"local_name\": \"submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST\", \"hostname\": \"lbvobox303.cern.ch\"}"
    handle (logging/__init__.py:978)
        Arguments:
            self: <MessageQueueHandler at 0x7f275e0dc8d0>
            record: <LogRecord at 0x7f26e00ae890>
        Locals:
            rv: True
    callHandlers (logging/__init__.py:1706)
        Arguments:
            self: <Logger at 0x7f275edd3e90>
            record: <LogRecord at 0x7f26e00ae890>
        Locals:
            c: <Logger at 0x7f275edd3e90>
            found: 2
            hdlr: <MessageQueueHandler at 0x7f275e0dc8d0>
    handle (logging/__init__.py:1644)
        Arguments:
            self: <Logger at 0x7f275edd3e90>
            record: <LogRecord at 0x7f26e00ae890>
    _log (logging/__init__.py:1634)
        Arguments:
            self: <Logger at 0x7f275edd3e90>
            level: 20
            msg: "%s"
            args: ("Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7")
            exc_info: False
            extra: {"componentname": "DataManagement/FTS3Agent", "varmessage": "", "spacer": "", "customname": "", "headerIsShown": True, "timeStampIsShown": True, "contextIsShown": True, "threadIDIsShown": False, "color": True, "local_name": "submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST"}
            stack_info: False
            stacklevel: 1
        Locals:
            sinfo: None
            fn: "(unknown file)"
            lno: 0
            func: "(unknown function)"
            record: <LogRecord at 0x7f26e00ae890>
    log (logging/__init__.py:1559)
        Arguments:
            self: <Logger at 0x7f275edd3e90>
            level: 20
            msg: "%s"
        Locals:
            args: ("Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7")
            kwargs: {"exc_info": False, "extra": {"componentname": "DataManagement/FTS3Agent", "varmessage": "", "spacer": "", "customname": "", "headerIsShown": True, "timeStampIsShown": True, "contextIsShown": True, "threadIDIsShown": False, "color": True, "local_name": "submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST"}}
    _createLogRecord (DIRAC/FrameworkSystem/private/standardLogging/Logging.py:421)
        Arguments:
....
chrisburr commented 2 hours ago

I think the problem is that we need to set a timeout in StompMQConnector.setupConnection.

ryuwd commented 2 hours ago

https://github.com/DIRACGrid/DIRAC/blob/integration/src/DIRAC/Resources/MessageQueue/StompMQConnector.py#L80-L87

I think it can be set here as timeout key

ryuwd commented 2 hours ago

the alternative is "heartbeats" https://jasonrbriggs.github.io/stomp.py/api.html#dealing-with-disconnects