CentaurusInfra / mizar

Mizar – Experimental, High Scale and High Performance Cloud Network https://mizar.readthedocs.io
https://mizar.readthedocs.io
GNU General Public License v2.0
112 stars 50 forks source link

Cannot create container due to File exists error #629

Open Sindica opened 2 years ago

Sindica commented 2 years ago

What happened: Trying to deploy two netpods from system tenant, one or two pods stuck in creating. In mizar operator, got log:

Was able to create pods after.

[2022-02-22 19:56:29,493] luigi-interface      [ERROR   ] [pid 7] Worker Worker(salt=020248522, workers=1, host=ip-172-30-0-14, username=root, pid=7) failed    k8sPodCreate(param=<mizar.common.wf_param.HandlerParam object at 0x7f9b980524c0>)
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/mizar/daemon/interface_service.py", line 320, in InitializeInterfaces
    resp = self.stub.InitializeInterfaces(interfaces_list)
  File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception calling application: (17, 'File exists')"
        debug_error_string = "{"created":"@1645559789.491610255","description":"Error received from peer ipv4:172.30.0.60:50051","file":"src/core/lib/surface/call.cc","file_line":903,"grpc_message":"Exception calling application: (17, 'File exists')","grpc_status":2}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/luigi/worker.py", line 199, in run
    new_deps = self._run_get_new_deps()
  File "/usr/local/lib/python3.9/site-packages/luigi/worker.py", line 141, in _run_get_new_deps
    task_gen = self.task.run()
  File "/usr/local/lib/python3.9/site-packages/mizar/dp/mizar/workflows/builtins/pods/create.py", line 177, in run
    interfaces = endpoint_opr.init_simple_endpoint_interfaces(
  File "/usr/local/lib/python3.9/site-packages/mizar/dp/mizar/operators/endpoints/endpoints_operator.py", line 445, in init_simple_endpoint_interfaces
    return InterfaceServiceClient(worker_ip).InitializeInterfaces(interfaces, task)
  File "/usr/local/lib/python3.9/site-packages/mizar/daemon/interface_service.py", line 327, in InitializeInterfaces
    task.raise_permanent_error(
  File "/usr/local/lib/python3.9/site-packages/mizar/common/workflow.py", line 51, in raise_permanent_error
    raise Exception(self.error_message)
Exception: Unknown gRPC error Exception calling application: (17, 'File exists')
[2022-02-22 19:56:29,494] luigi-interface      [INFO    ] Worker Worker(salt=580018591, workers=1, host=ip-172-30-0-14, username=root, pid=7) was stopped. Shutting down Keep-Alive thread
[2022-02-22 19:56:29,495] luigi-interface      [INFO    ]
===== Luigi Execution Summary =====

Scheduled 1 tasks of which:
* 1 failed:
    - 1 k8sPodCreate(param=<mizar.common.wf_param.HandlerParam object at 0x7f9b90625130>)

What you expected to happen: Pod be created successfully

How to reproduce it (as minimally and precisely as possible): Arktos scale out 2x2 environment, deploy pod

Environment:

Sindica commented 2 years ago
  1. Created two netpods from TP2, all succeed.
  2. Trying to create another two netpods with same name (as TP2), both stuck.
  3. Created another two netpods with different name, all succeed.
  4. Try to create netpods in step 2, one succeed, the other stuck.