Closed sambles closed 7 months ago
Issue duplicated, happens when multiple parallel analysis are running also look at
Worker pods (and nodes) don't always scale up or down when an analysis starts or stops. There may be a logical explanation but for now it feels unpredictable.
After split, connection error still happening:
2023-09-20 11:56:21,166 ERROR: Connection to oasis-server:8000 was closed
Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 956, in transfer_data
await asyncio.shield(self._put_message_waiter)
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/app/worker-controller/oasis_websocket.py", line 88, in watch
async for msg in next_msg(socket):
File "/app/worker-controller/oasis_websocket.py", line 54, in next_msg
while msg := await socket.recv():
File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 553, in recv
await self.ensure_open()
File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 930, in ensure_open
raise self.connection_closed_exc()
websockets.exceptions.ConnectionClosedError: sent 1011 (unexpected error) keepalive ping timeout; no close frame received
2023-09-20 11:56:21,168 ERROR: Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f0d9af7ebc0>
Server Side
DEBUG 2023-09-20 12:00:25,790 ws_protocol 6 140420406428096 Sent WebSocket packet to client for ['10.240.0.29', 60158]
DEBUG 2023-09-20 12:00:25,791 ws_protocol 6 140420406428096 WebSocket closed for ['10.240.0.29', 60158]
DEBUG 2023-09-20 12:00:25,794 connection 6 140420406428096 Cancelling waiter (<Future cancelled>, [None, None])
DEBUG 2023-09-20 12:00:25,794 util 6 140420406428096 Waiter future is already done <Future cancelled>
it means that the WebSocket connection suffered from excessive latency and was closed after reaching the timeout of websockets’ keepalive mechanism.
You can catch and handle [ConnectionClosed](https://websockets.readthedocs.io/en/stable/reference/exceptions.html#websockets.exceptions.ConnectionClosed) to prevent it from being logged.
There are two main reasons why latency may increase:
Poor network connectivity.
More traffic than the recipient can handle.
See the discussion of [timeouts](https://websockets.readthedocs.io/en/stable/topics/timeouts.html) for details.
If websockets’ default timeout of 20 seconds is too short for your use case, you can adjust it with the ping_timeout argument.
2023-09-20 12:41:24,491 ERROR: Connection to oasis-websocket:8001 was closed
Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 945, in transfer_data
message = await self.read_message()
File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 1015, in read_message
frame = await self.read_data_frame(max_size=self.max_size)
File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 1090, in read_data_frame
frame = await self.read_frame(max_size)
File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 1145, in read_frame
frame = await Frame.read(
File "/usr/lib/python3.10/site-packages/websockets/legacy/framing.py", line 70, in read
data = await reader(2)
File "/usr/lib/python3.10/asyncio/streams.py", line 697, in readexactly
raise self._exception
File "/usr/lib/python3.10/asyncio/selector_events.py", line 862, in _read_ready__data_received
data = self._sock.recv(self.max_size)
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/app/worker-controller/oasis_websocket.py", line 97, in watch
async for msg in next_msg(socket):
File "/app/worker-controller/oasis_websocket.py", line 63, in next_msg
while msg := await socket.recv():
File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 553, in recv
await self.ensure_open()
File "/usr/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 921, in ensure_open
raise self.connection_closed_exc()
websockets.exceptions.ConnectionClosedError: no close frame received or sent
Doesn't take other running analysis into account, If 20 worker nodes are executing and a new analysis is added which requests 1 node, then the replica set is overwritten with 1
killing 19 workers mid execution.
2023-09-20 14:15:49,778 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:15:50,700 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:15:50,763 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1"
2023-09-20 14:15:51,198 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:15:51,262 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:15:53,172 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:15:53,235 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:15:55,961 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:15:56,013 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:15:56,900 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:15:56,956 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:15:57,392 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:15:57,453 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:15:57,956 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:15:58,020 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:15:58,508 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:15:58,558 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:15:59,006 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:15:59,140 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:16:01,031 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:16:01,086 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:16:01,578 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:16:01,636 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:16:02,930 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:16:02,998 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:16:03,447 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:16:03,515 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:16:04,821 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:16:04,874 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:16:05,326 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:16:05,387 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:16:08,909 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:16:08,968 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:16:10,353 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:16:10,541 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:16:11,016 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:16:11,082 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:16:11,522 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:16:11,576 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:16:12,026 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:16:12,090 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:16:12,545 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:16:12,610 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:16:13,136 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:16:13,200 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
2023-09-20 14:16:16,742 INFO: Scale worker-SUPPLIER-MODEL-1 to 1 replicas
2023-09-20 14:16:16,801 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 1
2023-09-20 14:16:17,503 INFO: Scale worker-SUPPLIER-MODEL-1 to 10 replicas
2023-09-20 14:16:17,571 INFO: Deployment SUPPLIER-MODEL-1: updated replicas: 10
The Websocket's dispatched data is wrong, see "pending_count": 1
, queued_count": 0
and "running_count": 1
These should be the total of ALL running tasks across analyses.. for some reason they are reading as either 1
or 0
--> Split this into new issue, looks like a deeper problem than just worker-controller
2023-09-22 14:44:17,322 DEBUG: Socket message: {
"time": "2023-09-22T13:44:17.318004Z",
"type": "queue_status.updated",
"status": "SUCCESS",
"content": [
{
"queue": {
"name": "OasisLMF-PiWind-1",
"pending_count": 1,
"worker_count": 13,
"queued_count": 0,
"running_count": 1,
"models": [
{
"id": 1,
"supplier_id": "OasisLMF",
"model_id": "PiWind",
"version_id": "1",
"created": "2023-09-22T10:32:19.111903Z",
"modified": "2023-09-22T11:43:58.855209Z",
"data_files": [],
"resource_file": "/v1/models/1/resource_file/",
"settings": "/v1/models/1/settings/",
"versions": "/v1/models/1/versions/",
"scaling_configuration": "/v1/models/1/scaling_configuration/",
"chunking_configuration": "/v1/models/1/chunking_configuration/",
"groups": []
}
]
},
"analyses": [
{
"analysis": {
"name": "Analysis_22092023-144351",
"id": 18,
"portfolio": 18,
"model": 1,
"status": "RUN_STARTED",
"priority": 4,
"analysis_chunks": 12,
"lookup_chunks": 1,
"sub_task_count": 17,
"queue_names": [
"OasisLMF-PiWind-1",
"celery"
],
"status_count": {
"TOTAL_IN_QUEUE": 15,
"TOTAL": 17,
"PENDING": 3,
"QUEUED": 0,
"STARTED": 12,
"COMPLETED": 2,
"CANCELLED": 0,
"ERROR": 0
}
Issue Description
Generate losses running as:
Node Sizes
Standard_D8s_v3
Standard_E8ads_v5
(limited to 4 chunks per worker)Example data / logs