CentaurusInfra / mizar

Mizar – Experimental, High Scale and High Performance Cloud Network https://mizar.readthedocs.io
https://mizar.readthedocs.io
GNU General Public License v2.0
112 stars 50 forks source link

Subnet create - Retry host endpoint creation #624

Closed phudtran closed 2 years ago

phudtran commented 2 years ago

This PR adds a retry to host endpoint creation when the subnet comes up. Fixes an issue where operator tries to create a host endpoint before the daemon is up.

Sindica commented 2 years ago

Hi Phu,

I tried the same test again. TP1 works fine, but TP2 is missing eps host. I got same error in operator log:

2022-02-17T20:05:15.278456383Z stderr F [2022-02-17 20:05:15,276] luigi-interface      [ERROR   ] [pid 7] Worker Worker(salt=857259027, workers=1, host=ip-172-30-0-156, username=root, pid=7) failed    NetCreate(param=<mizar.common.wf_param.HandlerParam object at 0x7f81d477b250>)
2022-02-17T20:05:15.278485009Z stderr F Traceback (most recent call last):
2022-02-17T20:05:15.278493113Z stderr F   File "/usr/local/lib/python3.9/site-packages/luigi/worker.py", line 199, in run
2022-02-17T20:05:15.278499431Z stderr F     new_deps = self._run_get_new_deps()
2022-02-17T20:05:15.27850451Z stderr F   File "/usr/local/lib/python3.9/site-packages/luigi/worker.py", line 141, in _run_get_new_deps
2022-02-17T20:05:15.278510031Z stderr F     task_gen = self.task.run()
2022-02-17T20:05:15.278515828Z stderr F   File "/usr/local/lib/python3.9/site-packages/mizar/dp/mizar/workflows/nets/create.py", line 76, in run
2022-02-17T20:05:15.278521153Z stderr F     droplet.interfaces = endpoints_opr.init_host_endpoint_interfaces(
2022-02-17T20:05:15.278527297Z stderr F   File "/usr/local/lib/python3.9/site-packages/mizar/dp/mizar/operators/endpoints/endpoints_operator.py", line 463, in init_host_endpoint_interfaces
2022-02-17T20:05:15.278533697Z stderr F     return InterfaceServiceClient(droplet.main_ip).InitializeInterfaces(interfaces)
2022-02-17T20:05:15.278539027Z stderr F   File "/usr/local/lib/python3.9/site-packages/mizar/daemon/interface_service.py", line 319, in InitializeInterfaces
2022-02-17T20:05:15.278575591Z stderr F     resp = self.stub.InitializeInterfaces(interfaces_list)
2022-02-17T20:05:15.278583231Z stderr F   File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 946, in __call__
2022-02-17T20:05:15.278588463Z stderr F     return _end_unary_response_blocking(state, call, False, None)
2022-02-17T20:05:15.278593588Z stderr F   File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
2022-02-17T20:05:15.278598731Z stderr F     raise _InactiveRpcError(state)
2022-02-17T20:05:15.278603842Z stderr F grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
2022-02-17T20:05:15.278610581Z stderr F         status = StatusCode.UNAVAILABLE
2022-02-17T20:05:15.278615824Z stderr F         details = "failed to connect to all addresses"
2022-02-17T20:05:15.278621718Z stderr F         debug_error_string = "{"created":"@1645128315.275418059","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1645128315.275416392","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
phudtran commented 2 years ago

Hi Phu,

I tried the same test again. TP1 works fine, but TP2 is missing eps host. I got same error in operator log:

2022-02-17T20:05:15.278456383Z stderr F [2022-02-17 20:05:15,276] luigi-interface      [ERROR   ] [pid 7] Worker Worker(salt=857259027, workers=1, host=ip-172-30-0-156, username=root, pid=7) failed    NetCreate(param=<mizar.common.wf_param.HandlerParam object at 0x7f81d477b250>)
2022-02-17T20:05:15.278485009Z stderr F Traceback (most recent call last):
2022-02-17T20:05:15.278493113Z stderr F   File "/usr/local/lib/python3.9/site-packages/luigi/worker.py", line 199, in run
2022-02-17T20:05:15.278499431Z stderr F     new_deps = self._run_get_new_deps()
2022-02-17T20:05:15.27850451Z stderr F   File "/usr/local/lib/python3.9/site-packages/luigi/worker.py", line 141, in _run_get_new_deps
2022-02-17T20:05:15.278510031Z stderr F     task_gen = self.task.run()
2022-02-17T20:05:15.278515828Z stderr F   File "/usr/local/lib/python3.9/site-packages/mizar/dp/mizar/workflows/nets/create.py", line 76, in run
2022-02-17T20:05:15.278521153Z stderr F     droplet.interfaces = endpoints_opr.init_host_endpoint_interfaces(
2022-02-17T20:05:15.278527297Z stderr F   File "/usr/local/lib/python3.9/site-packages/mizar/dp/mizar/operators/endpoints/endpoints_operator.py", line 463, in init_host_endpoint_interfaces
2022-02-17T20:05:15.278533697Z stderr F     return InterfaceServiceClient(droplet.main_ip).InitializeInterfaces(interfaces)
2022-02-17T20:05:15.278539027Z stderr F   File "/usr/local/lib/python3.9/site-packages/mizar/daemon/interface_service.py", line 319, in InitializeInterfaces
2022-02-17T20:05:15.278575591Z stderr F     resp = self.stub.InitializeInterfaces(interfaces_list)
2022-02-17T20:05:15.278583231Z stderr F   File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 946, in __call__
2022-02-17T20:05:15.278588463Z stderr F     return _end_unary_response_blocking(state, call, False, None)
2022-02-17T20:05:15.278593588Z stderr F   File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
2022-02-17T20:05:15.278598731Z stderr F     raise _InactiveRpcError(state)
2022-02-17T20:05:15.278603842Z stderr F grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
2022-02-17T20:05:15.278610581Z stderr F         status = StatusCode.UNAVAILABLE
2022-02-17T20:05:15.278615824Z stderr F         details = "failed to connect to all addresses"
2022-02-17T20:05:15.278621718Z stderr F         debug_error_string = "{"created":"@1645128315.275418059","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1645128315.275416392","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"

Do you see this line after that error? "Daemon not yet ready for droplet some_ip_here" The operator should retry, until eventually it creates the host endpoint once the daemon is up. If the host endpoint never comes up, there may be another issue.