Worker Controller crashing under heavy load

sambles commented 1 year ago

Issue Description

Generate losses running as:

200 analysis chunks
across 35 workers

Node Sizes

Platform is running on Standard_D8s_v3
Worker Standard_E8ads_v5 (limited to 4 chunks per worker)

Example data / logs

2023-08-04 08:22:35,549 INFO: Analysis for model xxx-xxx-xxx is running                                                     

2023-08-04 08:22:36,125 INFO: Total desired number of workers: 35
2023-08-04 08:22:36,126 ERROR: Connection to oasis-websocket:8001 was closed
Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 956, in transfer_data
    await asyncio.shield(self._put_message_waiter)
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/worker-controller/oasis_websocket.py", line 87, in watch
    async for msg in next_msg(socket):
  File "/app/worker-controller/oasis_websocket.py", line 53, in next_msg
    while msg := await socket.recv():
  File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 553, in recv
    await self.ensure_open()
  File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 921, in ensure_open
    raise self.connection_closed_exc()
websockets.exceptions.ConnectionClosedError: sent 1011 (unexpected error) keepalive ping timeout; no close frame received
Stream closed EOF for oasis-enterprise-preprod/oasis-worker-controller-xxxx (main)

sambles commented 1 year ago

Issue duplicated, happens when multiple parallel analysis are running also look at

Worker pods (and nodes) don't always scale up or down when an analysis starts or stops. There may be a logical explanation but for now it feels unpredictable.

sambles commented 1 year ago

After split, connection error still happening:

2023-09-20 11:56:21,166 ERROR: Connection to oasis-server:8000 was closed
Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 956, in transfer_data
    await asyncio.shield(self._put_message_waiter)
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/worker-controller/oasis_websocket.py", line 88, in watch
    async for msg in next_msg(socket):
  File "/app/worker-controller/oasis_websocket.py", line 54, in next_msg
    while msg := await socket.recv():
  File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 553, in recv
    await self.ensure_open()
  File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 930, in ensure_open
    raise self.connection_closed_exc()
websockets.exceptions.ConnectionClosedError: sent 1011 (unexpected error) keepalive ping timeout; no close frame received
2023-09-20 11:56:21,168 ERROR: Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f0d9af7ebc0>

Server Side

DEBUG 2023-09-20 12:00:25,790 ws_protocol 6 140420406428096 Sent WebSocket packet to client for ['10.240.0.29', 60158]
DEBUG 2023-09-20 12:00:25,791 ws_protocol 6 140420406428096 WebSocket closed for ['10.240.0.29', 60158]
DEBUG 2023-09-20 12:00:25,794 connection 6 140420406428096 Cancelling waiter (<Future cancelled>, [None, None])
DEBUG 2023-09-20 12:00:25,794 util 6 140420406428096 Waiter future is already done <Future cancelled>

sambles commented 1 year ago

https://websockets.readthedocs.io/en/stable/faq/common.html#what-does-connectionclosederror-sent-1011-unexpected-error-keepalive-ping-timeout-no-close-frame-received-mean

it means that the WebSocket connection suffered from excessive latency and was closed after reaching the timeout of websockets’ keepalive mechanism.

You can catch and handle [ConnectionClosed](https://websockets.readthedocs.io/en/stable/reference/exceptions.html#websockets.exceptions.ConnectionClosed) to prevent it from being logged.

There are two main reasons why latency may increase:

    Poor network connectivity.

    More traffic than the recipient can handle.

See the discussion of [timeouts](https://websockets.readthedocs.io/en/stable/topics/timeouts.html) for details.

If websockets’ default timeout of 20 seconds is too short for your use case, you can adjust it with the ping_timeout argument.

sambles commented 1 year ago

Code 1011 - WS Latency check failed Code 1009 - WS Max message size of 1MiB exceeded

sambles commented 1 year ago

2023-09-20 12:41:24,491 ERROR: Connection to oasis-websocket:8001 was closed
Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 945, in transfer_data
    message = await self.read_message()
  File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 1015, in read_message
    frame = await self.read_data_frame(max_size=self.max_size)
  File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 1090, in read_data_frame
    frame = await self.read_frame(max_size)
  File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 1145, in read_frame
    frame = await Frame.read(
  File "/usr/lib/python3.10/site-packages/websockets/legacy/framing.py", line 70, in read
    data = await reader(2)
  File "/usr/lib/python3.10/asyncio/streams.py", line 697, in readexactly
    raise self._exception
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 862, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/worker-controller/oasis_websocket.py", line 97, in watch
    async for msg in next_msg(socket):
  File "/app/worker-controller/oasis_websocket.py", line 63, in next_msg
    while msg := await socket.recv():
  File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 553, in recv
    await self.ensure_open()
  File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 921, in ensure_open
    raise self.connection_closed_exc()
websockets.exceptions.ConnectionClosedError: no close frame received or sent

sambles commented 1 year ago

This check: https://github.com/OasisLMF/OasisPlatform/blob/a9ca30ef290f1ed009461b3c3ad1001d19e98f94/kubernetes/worker-controller/src/autoscaler_rules.py#L37-L43

Doesn't take other running analysis into account, If 20 worker nodes are executing and a new analysis is added which requests 1 node, then the replica set is overwritten with 1 killing 19 workers mid execution.

2023-09-20 14:15:49,778 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:15:50,700 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:15:50,763 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1"
2023-09-20 14:15:51,198 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:15:51,262 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:15:53,172 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:15:53,235 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1                                                                                                                                 
2023-09-20 14:15:55,961 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:15:56,013 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:15:56,900 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:15:56,956 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:15:57,392 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:15:57,453 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:15:57,956 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:15:58,020 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:15:58,508 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:15:58,558 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:15:59,006 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:15:59,140 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:16:01,031 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:16:01,086 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:16:01,578 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:16:01,636 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:16:02,930 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:16:02,998 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:16:03,447 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:16:03,515 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:16:04,821 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:16:04,874 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:16:05,326 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:16:05,387 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:16:08,909 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:16:08,968 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:16:10,353 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:16:10,541 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:16:11,016 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:16:11,082 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:16:11,522 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:16:11,576 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:16:12,026 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:16:12,090 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:16:12,545 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:16:12,610 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:16:13,136 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:16:13,200 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:16:16,742 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:16:16,801 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:16:17,503 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:16:17,571 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10

sambles commented 1 year ago

The Websocket's dispatched data is wrong, see "pending_count": 1, queued_count": 0 and "running_count": 1

https://github.com/OasisLMF/OasisPlatform/blob/a9ca30ef290f1ed009461b3c3ad1001d19e98f94/src/server/oasisapi/queues/utils.py#L86-L121

These should be the total of ALL running tasks across analyses.. for some reason they are reading as either 1 or 0

--> Split this into new issue, looks like a deeper problem than just worker-controller

2023-09-22 14:44:17,322 DEBUG: Socket message: {
    "time": "2023-09-22T13:44:17.318004Z",
    "type": "queue_status.updated",
    "status": "SUCCESS",
    "content": [
        {
            "queue": {
                "name": "OasisLMF-PiWind-1",
                "pending_count": 1,
                "worker_count": 13,
                "queued_count": 0,
                "running_count": 1,
                "models": [
                    {
                        "id": 1,
                        "supplier_id": "OasisLMF",
                        "model_id": "PiWind",
                        "version_id": "1",
                        "created": "2023-09-22T10:32:19.111903Z",
                        "modified": "2023-09-22T11:43:58.855209Z",
                        "data_files": [],
                        "resource_file": "/v1/models/1/resource_file/",
                        "settings": "/v1/models/1/settings/",
                        "versions": "/v1/models/1/versions/",
                        "scaling_configuration": "/v1/models/1/scaling_configuration/",
                        "chunking_configuration": "/v1/models/1/chunking_configuration/",
                        "groups": []
                    }
                ]
            },
            "analyses": [
                {
                    "analysis": {
                        "name": "Analysis_22092023-144351",
                        "id": 18,
                        "portfolio": 18,
                        "model": 1,
                        "status": "RUN_STARTED",
                        "priority": 4,
                        "analysis_chunks": 12,
                        "lookup_chunks": 1,
                        "sub_task_count": 17,
                        "queue_names": [
                            "OasisLMF-PiWind-1",
                            "celery"
                        ],
                        "status_count": {
                            "TOTAL_IN_QUEUE": 15,
                            "TOTAL": 17,
                            "PENDING": 3,
                            "QUEUED": 0,
                            "STARTED": 12,
                            "COMPLETED": 2,
                            "CANCELLED": 0,
                            "ERROR": 0
                        }

OasisLMF / OasisPlatform

Worker Controller crashing under heavy load #869

Issue Description

Example data / logs