eu-nebulous / optimiser-controller

Mozilla Public License 2.0
0 stars 0 forks source link

Optimiser controller failed to get cluster status #15

Open robert-sanfeliu opened 1 month ago

robert-sanfeliu commented 1 month ago

During a deployment of a dummy APP, optimiser controller failed to request the status of a cluster and, consequently, it terminated the deployment.

Here are the logs from the controller:

{"@timestamp":"2024-07-24T13:02:25.128360749Z","@version":"1","message":"exn-middleware-sal request failed with error code '' and message '', caller 'getCluster'","logger_name":"eu.nebulouscloud.optimiser.controller.ExnConnector","thread_name":"Thread-11","level":"ERROR","level_value":40000,"appId":"1455022407rest-processor-app1721825702821","clusterName":"14550-18"}
{"@timestamp":"2024-07-24T13:02:25.128454587Z","@version":"1","message":"getCluster returned invalid result (null or structure without 'status' field) too many times, giving up","logger_name":"eu.nebulouscloud.optimiser.controller.NebulousAppDeployer","thread_name":"Thread-11","level":"WARN","level_value":30000,"appId":"1455022407rest-processor-app1721825702821","clusterName":"14550-18"}
{"@timestamp":"2024-07-24T13:02:25.12847285Z","@version":"1","message":"Error while waiting for deployCluster to finish, trying to delete cluster {\"name\":\"14550-18\",\"master-node\":\"m14550-18-master\",\"nodes\":[{\"nodeName\":\"m14550-18-master\",\"nodeCandidateId\":\"8a74849d90dadc8d0190df8f574b15c8\",\"cloudId\":\"c9a625c7-f705-4128-948f-6b5765509029\"},{\"nodeName\":\"n14550-18-dummy-app-controller-1-1\",\"nodeCandidateId\":\"8a74849d90dadc8d0190df8f574b15c8\",\"cloudId\":\"c9a625c7-f705-4128-948f-6b5765509029\"},{\"nodeName\":\"n14550-18-dummy-app-worker-1-1\",\"nodeCandidateId\":\"8a74849d90dadc8d0190df8f574b15c8\",\"cloudId\":\"c9a625c7-f705-4128-948f-6b5765509029\"}],\"env-var\":{\"APPLICATION_ID\":\"1455022407rest-processor-app1721825702821\",\"BROKER_ADDRESS\":\"158.37.63.86\",\"ACTIVEMQ_HOST\":\"158.37.63.86\",\"BROKER_PORT\":\"32754\",\"ACTIVEMQ_PORT\":\"32754\",\"ONM_IP\":\"158.39.201.249\",\"ONM_URL\":\"http://158.37.63.36:8082/\"}} and aborting deployment","logger_name":"eu.nebulouscloud.optimiser.controller.NebulousAppDeployer","thread_name":"Thread-11","level":"ERROR","level_value":40000,"appId":"1455022407rest-processor-app1721825702821","clusterName":"14550-18"}

I investigated the logs on the EXN middleware component but couldn't see any request from optimiser controller. I re-tried the deployment of the same app and, this time, EXN middleware recieved the requests from the controller and the deployment went OK.

Could it be a bug on the optimiser controller? Could it be a bug on the EXN middleware? Could it be a bug on the EXN middleware Java library?

rudi commented 1 month ago

Was that during a parallel deployment, or was only one deployment ongoing?

robert-sanfeliu commented 1 month ago

I can't remember, I'll keep an eye on the issue and If I see it happening again, I'll tell you.