ITISFoundation / osparc-simcore

🐼 osparc-simcore simulation framework
https://osparc.io
MIT License
46 stars 27 forks source link

Garbage collector not working on aws-prod #3975

Closed mrnicegyu11 closed 1 year ago

mrnicegyu11 commented 1 year ago

Is there an existing issue for this?

Current Behavior

Garbage collector for some days only shows error, there are no more "regular" logs and garbage collection seems to not happen. The errors are of this kind:

WARNING: [2023-03-15 12:26:49,577/MainProcess] [servicelib.utils:logged_gather(122)]  -  Error in 1-th concurrent task <coroutine object _remove_single_orphaned_service at 0x7fa1887d8ec0>: 'f2d26379-e6fc-50dd-956a-3f4f67d2542c'
WARNING: [2023-03-15 12:26:49,577/MainProcess] [servicelib.utils:logged_gather(122)]  -  Error in 2-th concurrent task <coroutine object _remove_single_orphaned_service at 0x7fa1886863c0>: '843fbe7b-2e50-56b3-9ad9-752de771bf21'
WARNING: [2023-03-15 12:26:49,577/MainProcess] [servicelib.utils:logged_gather(122)]  -  Error in 3-th concurrent task <coroutine object _remove_single_orphaned_service at 0x7fa188e67ac0>: '41d7bcb2-af42-5104-b662-5c66e747bbf4'
WARNING: [2023-03-15 12:26:49,577/MainProcess] [servicelib.utils:logged_gather(122)]  -  Error in 4-th concurrent task <coroutine object _remove_single_orphaned_service at 0x7fa1887de0c0>: '67c34fc6-fa9f-5eaf-bc0d-8012117707cc'
WARNING: [2023-03-15 12:26:49,577/MainProcess] [servicelib.utils:logged_gather(122)]  -  Error in 5-th concurrent task <coroutine object _remove_single_orphaned_service at 0x7fa1887de8c0>: 'b57f4e59-13d0-476d-9954-9855adf657b7'
WARNING: [2023-03-15 12:26:49,577/MainProcess] [servicelib.utils:logged_gather(122)]  -  Error in 6-th concurrent task <coroutine object _remove_single_orphaned_service at 0x7fa188b57c40>: 'fd123ae9-3242-5eb1-bf02-c04b942f2992'
WARNING: [2023-03-15 12:26:49,577/MainProcess] [servicelib.utils:logged_gather(122)]  -  Error in 7-th concurrent task <coroutine object _remove_single_orphaned_service at 0x7fa188b57e40>: '7e135c19-c89d-5081-bb90-d07ee9d3dc26'
WARNING: [2023-03-15 12:26:49,577/MainProcess] [servicelib.utils:logged_gather(122)]  -  Error in 14-th concurrent task <coroutine object _remove_single_orphaned_service at 0x7fa1885ce4c0>: '0c417ffb-8d03-4b68-9ead-dbef12a4af86'
WARNING: [2023-03-15 12:27:21,783/MainProcess] [servicelib.utils:logged_gather(122)]  -  Error in 1-th concurrent task <coroutine object _remove_single_orphaned_service at 0x7fa1887d8ec0>: 'f2d26379-e6fc-50dd-956a-3f4f67d2542c'
WARNING: [2023-03-15 12:27:21,783/MainProcess] [servicelib.utils:logged_gather(122)]  -  Error in 2-th concurrent task <coroutine object _remove_single_orphaned_service at 0x7fa188b57e40>: '843fbe7b-2e50-56b3-9ad9-752de771bf21'
WARNING: [2023-03-15 12:27:21,783/MainProcess] [servicelib.utils:logged_gather(122)]  -  Error in 3-th concurrent task <coroutine object _remove_single_orphaned_service at 0x7fa18879eb40>: '41d7bcb2-af42-5104-b662-5c66e747bbf4'
WARNING: [2023-03-15 12:27:21,783/MainProcess] [servicelib.utils:logged_gather(122)]  -  Error in 4-th concurrent task <coroutine object _remove_single_orphaned_service at 0x7fa1887de8c0>: '67c34fc6-fa9f-5eaf-bc0d-8012117707cc'
WARNING: [2023-03-15 12:27:21,783/MainProcess] [servicelib.utils:logged_gather(122)]  -  Error in 5-th concurrent task <coroutine object _remove_single_orphaned_service at 0x7fa1887de0c0>: 'b57f4e59-13d0-476d-9954-9855adf657b7'
WARNING: [2023-03-15 12:27:21,783/MainProcess] [servicelib.utils:logged_gather(122)]  -  Error in 6-th concurrent task <coroutine object _remove_single_orphaned_service at 0x7fa1886073c0>: 'fd123ae9-3242-5eb1-bf02-c04b942f2992'
WARNING: [2023-03-15 12:27:21,783/MainProcess] [servicelib.utils:logged_gather(122)]  -  Error in 7-th concurrent task <coroutine object _remove_single_orphaned_service at 0x7fa1886372c0>: '7e135c19-c89d-5081-bb90-d07ee9d3dc26'
WARNING: [2023-03-15 12:27:21,783/MainProcess] [servicelib.utils:logged_gather(122)]  -  Error in 14-th concurrent task <coroutine object _remove_single_orphaned_service at 0x7fa18872e940>: '0c417ffb-8d03-4b68-9ead-dbef12a4af86'

The graylog queries that can be used to check if this happens are:

Further evidence of garbage collection not working is that in prometheus one can see a s4-lite service running for many days, to observe this use the PromQL querry: container_memory_usage_bytes{image=~"^.*[.osparc.io].*/simcore/services/dynamic/s4l-core-lite.*$",name=~"dy-sidecar-b57f4e59-13d0-476d-9954-9855adf657b7.*"} Comparison with the redis keys, that correspond to open browser-tabs or sessions, show that there was no session key for the user that owns the project containing this s4l for some days, so the garbage collector should have kicked in: redis_key_value{key=~"^user_id=2:.*$"}

Expected Behavior

Garbage collection works

Steps To Reproduce

The GC does not work on aws-prod

Anything else?

This affects production and may cause it to not run smooth if services accumulate. From my feelings, I would put this on high urgency.

mrnicegyu11 commented 1 year ago

CC @mguidon