ITISFoundation / osparc-issues

🐼 issue-only repo for the osparc project
3 stars 5 forks source link

Freshly created iSeg nodes fail (on osparc.io) #718

Closed elisabettai closed 2 years ago

elisabettai commented 2 years ago

Reporting an issue for @newton1985, happened for 2 iseg nodes on osparc.io on 15-09, between 10:06 UTC and 10:27 UTC. These are newly-created iSeg nodes.

For the first node, this is what I got from graylog (searching "909e0a1d-bd83-4d26-8633-a51e671c18bc", between the time range above):

director-v2:

ERROR:simcore_service_director_v2.modules.dynamic_sidecar.api_client._base:Unexpected error with 'e.request=<Request('GET', 'http://dy-sidecar_909e0a1d-bd83-4d26-8633-a51e671c18bc:8000/health')>': e=ConnectError('[Errno -2] Name or service not known'), (attempt [1/4])

Then some retries and fails with (I am omitting the full message, can be retrieved with OEC:140448794075776):

Timed out while searching for an assigned NodeID for service_id=itlluyq1i6haitidkmsfpgc3h.

Then:

[1;33mWARNING: [2022-09-15 10:08:17,282/MainProcess] [simcore_service_webserver.director_v2_core_base:log_it(51)]  -  Retrying None in 0.3491334177798098 seconds as it raised DirectorServiceError: Unexpected error: director-v2 returned 404, reason "{'data': {'status': 404, 'message': 'The service with uuid 909e0a1d-bd83-4d26-8633-a51e671c18bc was not found'}}" after calling URL('http://production_director-v2:8000/v2/dynamic_services/909e0a1d-bd83-4d26-8633-a51e671c18bc').

The other iSeg node (89e852f7-6501-4a7d-ae26-89f45ea5e52c) fails in a similar manner.

fyi @sanderegg. Maybe is should have been already fixed by PR #3218

newton1985 commented 2 years ago

@elisabettai @sanderegg - just for additional context, I tried instantiating at least 20 new iSeg nodes on 15.9.22 between 10:00 CET and 16:30 CET. So it is not isolated to the two cases referred to above.

elisabettai commented 2 years ago

Also got a support request for a sim4life-dy service on 16-09 around 16:00 (Zurich time). The user reported that she can create the sim4life-dy node, the application was loading, but the she got a white screen (see that on Fogbugz).

This is what I found in graylog (study id: a6d3d03a-14bd-11ed-a399-02420a0b0527, node id: a659701d-75c8-4219-95de-3edd9d0b1d88).

ERROR:simcore_service_director_v2.modules.dynamic_sidecar.api_client._base:Unexpected error with 'e.request=<Request('GET', 'http://dy-sidecar_a659701d-75c8-4219-95de-3edd9d0b1d88:8000/health')>': e=ConnectError('[Errno -2] Name or service not known'), (attempt [1/4])

ERROR:simcore_service_director_v2.modules.dynamic_sidecar.api_client._base:Unexpected error with 'e.request=<Request('GET', 'http://dy-sidecar_a659701d-75c8-4219-95de-3edd9d0b1d88:8000/v1/containers?only_status=true')>': e=ConnectError('[Errno -2] Name or service not known'), (attempt [1/4])

ERROR:simcore_service_director_v2.modules.dynamic_sidecar.api_client._base:Unexpected error with 'e.request=<Request('GET', 'http://dy-sidecar_a659701d-75c8-4219-95de-3edd9d0b1d88:8000/v1/containers?only_status=true')>': e=ConnectError('All connection attempts failed'), (attempt [2/4])

Then it retries, and gives a 404.

[0;32mINFO: [2022-09-16 14:47:57,980/MainProcess] [uvicorn.access:send(438)]  -  172.13.5.248:60372 - "GET /v1/containers/name?filters=%7B%22network%22:%20%22dy-sidecar_a659701d-75c8-4219-95de-3edd9d0b1d88%22%7D HTTP/1.1" 404

Then it tries to upload the workspace?

[0;32mINFO: [2022-09-16 14:49:04,435/MainProcess] [simcore_sdk.node_data.data_manager:_push_file(41)]  -  uploading workspace.zip to S3 to a6d3d03a-14bd-11ed-a399-02420a0b0527/a659701d-75c8-4219-95de-3edd9d0b1d88/workspace.zip

It also looks to me that the container (at least the dy-sidecar and proxy are running), but "manager3" can't find the node.

elisabettai commented 2 years ago

According to @GitHK, there should be a fix in master and staging, that needed to be tested.

elisabettai commented 2 years ago

Still getting " ConnectError('[Errno -2] Name or service not known" on master (while testing opening of Bladder Control, this happens for simt4life-dy nodes). Some more logs are here.

Opening a new study and adding a new sim4life-dy node works, although a similar error appears, see here.

GitHK commented 2 years ago

iSeg was failing due to lack of resources on production. sim4life-dy should also now start properly on production, the issue was hot fixed

elisabettai commented 2 years ago

I've just tested on osparc.io and works (added new iSeg and sim4life-dy nodes).

Hot fixed with: https://github.com/ITISFoundation/osparc-simcore/releases/tag/v1.34.6 (checked with PC)