ITISFoundation / osparc-issues

🐼 issue-only repo for the osparc project
3 stars 5 forks source link

TVB service fails on osparc.io #835

Closed elisabettai closed 1 year ago

elisabettai commented 1 year ago

Long Story Short The TVB service fails starting on osparc.io

Steps to reproduce In a new study, add a new tvb node

Additional context This is a legacy dynamic service.

In Portainer, tvb-app (backend part of the service) stays pending forever, with error message no suitable node (insufficient resources on 8 nodes; scheduling constraints not satisfied on 3 nodes)

tvb-web fails (11 instances are there) with host not found in upstream "tvb-app_6914e85f-ce9c-4ef3-be7b-b897f1627057:8060" in /etc/nginx/conf.d/default.conf:2

This is what the user sees in the logger: Service failed: task: non-zero exit (1)

GitHK commented 1 year ago

@mguidon could this be related to the default resource changes?

elisabettai commented 1 year ago

Also iseg fails super-quick when trying to add a new node.

elisabettai commented 1 year ago

With iseg, web dev tools gave a 503. There was also this pydantic error in the director none is not an allowed value (type=type_error.none.not_allowed) image

pcrespov commented 1 year ago

@mguidon could this be related to the default resource changes?

No. After some debugging with @sanderegg we noticed that the problem was that two services associated to @elisabettai were failing the validation above because all associated containers failed. The service exists but has no container therefore the service_port == None and therefore the validator RunningDynamicServiceDetails failed.

This issue needs a follow up: @GitHK we need to determine what is expected from get_running_services since a service can still exists without any running container.

SEE https://github.com/ITISFoundation/osparc-simcore/blob/308fa08e60693836e1ce225e53b535c95bd95384/services/director-v2/src/simcore_service_director_v2/modules/director_v0.py#L145-L164

@GitHK please let me know when you have a moment to check this issue.

sanderegg commented 1 year ago

This is what happened to @elisabettai :

pcrespov commented 1 year ago

Workaround: @sanderegg deleted legacy services that had no containers running and everything worked again.

elisabettai commented 1 year ago

Closing since from user perspective the workaround solved it. Let me know @pcrespov if I should create a follow-up case in osparc-simcore (or if you prefer doing that).