FederatedAI / KubeFATE

Manage federated learning workload using cloud native technologies.
Apache License 2.0
424 stars 221 forks source link

kubefate on kubernetes (1master, 1 worker node) python error No such file or directory: '1' #796

Open pamystri opened 2 years ago

pamystri commented 2 years ago

What deployment mode you are use?

  1. Kuberentes.

What KubeFATE and FATE version you are using?

v1.9.0

What OS you are using for docker-compse or Kubernetes? Please also clear the version of OS.

OS: Ubuntu Version 18.04

OS: Windows 10 Browser Firefox Version 106.03

To Reproduce

I am getting the below error from python-0 pod

[ERROR][2022-11-03 17:22:08,549][command_client_2,pid:17,tid:140321505650432][client.py:96.sync_send] - Error calling to nodemanager-1.nodemanager:37019, command_uri: CommandURI(_uri=v1/egg-pair/runTask), req:ErCommandRequest(id=20221103.172208.543608, uri=v1/egg-pair/runTask, args=[[b'\np202211031721585025250_reader_0_0_guest_9999-py-job-20221103.172208.541952_cleanup-task-nodemanager-1.nodemanager\x12\x07destroy\x1aV\x08\xff\xff\xff\xff\xff\xff\xff\xff\xff\x01\x12>\x08\xff\xff\xff\xff\xff\xff\xff\xff\xff\x01\x12\x01\x1a+202211031721585025250_reader_0_0_guest_9999"\x01(\xff\xff\xff\xff\xff\xff\xff\xff\xff\x01*\x9e\x01\nQ202211031721585025250_reader_0_0_guest_9999-py-job-20221103.172208.541952_cleanup\x12\x07d'], len=1], kwargs=[***, len=0]) Traceback (most recent call last): File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 84, in sync_send response = _command_stub.call(request.to_proto()) File "/opt/python3/lib/python3.8/site-packages/grpc/_channel.py", line 946, in call return _end_unary_response_blocking(state, call, False, None) File "/opt/python3/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking raise _InactiveRpcError(state) grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNKNOWN details = "Exception calling application:

==== detail start, at 20221103.172208.548 ==== Traceback (most recent call last): File "/data/projects/fate/eggroll/python/eggroll/core/utils.py", line 187, in wrapper return func(*args, **kw) File "/data/projects/fate/eggroll/python/eggroll/roll_pair/egg_pair.py", line 245, in run_task shutil.rmtree(path) File "/opt/python3/lib/python3.8/shutil.py", line 718, in rmtree _rmtree_safe_fd(fd, path, onerror) File "/opt/python3/lib/python3.8/shutil.py", line 655, in _rmtree_safe_fd _rmtree_safe_fd(dirfd, fullname, onerror) File "/opt/python3/lib/python3.8/shutil.py", line 645, in _rmtree_safe_fd onerror(os.lstat, fullname, sys.exc_info()) File "/opt/python3/lib/python3.8/shutil.py", line 642, in _rmtree_safe_fd orig_st = entry.stat(follow_symlinks=False) FileNotFoundError: [Errno 2] No such file or directory: '1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/data/projects/fate/eggroll/python/eggroll/core/utils.py", line 187, in wrapper return func(*args, *kw) File "/data/projects/fate/eggroll/python/eggroll/core/command/command_service.py", line 30, in call call_result = CommandRouter.get_instance() \ File "/data/projects/fate/eggroll/python/eggroll/core/command/command_router.py", line 94, in dispatch raise e File "/data/projects/fate/eggroll/python/eggroll/core/command/command_router.py", line 91, in dispatch call_result = _method(_instance, deserialized_args) File "/data/projects/fate/eggroll/python/eggroll/core/utils.py", line 194, in wrapper raise RuntimeError(msg) RuntimeError:

==== detail start, at 20221103.172208.546 ==== Traceback (most recent call last): File "/data/projects/fate/eggroll/python/eggroll/core/utils.py", line 187, in wrapper return func(*args, **kw) File "/data/projects/fate/eggroll/python/eggroll/roll_pair/egg_pair.py", line 245, in run_task shutil.rmtree(path) File "/opt/python3/lib/python3.8/shutil.py", line 718, in rmtree _rmtree_safe_fd(fd, path, onerror) File "/opt/python3/lib/python3.8/shutil.py", line 655, in _rmtree_safe_fd _rmtree_safe_fd(dirfd, fullname, onerror) File "/opt/python3/lib/python3.8/shutil.py", line 645, in _rmtree_safe_fd onerror(os.lstat, fullname, sys.exc_info()) File "/opt/python3/lib/python3.8/shutil.py", line 642, in _rmtree_safe_fd orig_st = entry.stat(follow_symlinks=False) FileNotFoundError: [Errno 2] No such file or directory: '1'

==== detail end ====

==== detail end ====

" debug_error_string = "{"created":"@1667496128.548778243","description":"Error received from peer ipv4:10.42.182.16:37019", "file":"src/core/lib/surface/call.cc","file_line":952,"grpc_message":"Exception calling application: \n\n==== detail start, at 20221103.172208.548 ====\nTraceback (most recent call last):\n File "/data/projects/fate/eggroll/python/eggroll/core/utils.py", line 187, in wrapper\n return func(*args, *kw)\n File "/data/projects/fate/eggroll/python/eggroll/roll_pair/egg_pair.py", line 245, in run_task\n shutil.rmtree(path)\n File "/opt/python3/lib/python3.8/shutil.py", line 718, in rmtree\n
_rmtree_safe_fd(fd, path, onerror)\n File "/opt/python3/lib/python3.8/shutil.py", line 655, in _rmtree_safe_fd\n _rmtree_safe_fd(dirfd, fullname, onerror)\n File "/opt/python3/lib/python3.8/shutil.py", line 645, in _rmtree_safe_fd\n
onerror(os.lstat, fullname, sys.exc_info())\n File "/opt/python3/lib/python3.8/shutil.py", line 642, in _rmtree_safe_fd\n orig_st = entry.stat(follow_symlinks=False)\nFileNotFoundError: [Errno 2] No such file or directory: '1'\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/data/projects/fate/eggroll/python/eggroll/core/utils.py", line 187, in wrapper\n
return func(
args, kw)\n File "/data/projects/fate/eggroll/python/eggroll/core/command/command_service.py", line 30, in call\n
call_result = CommandRouter.get_instance() \n File "/data/projects/fate/eggroll/python/eggroll/core/command/command_router.py", line 94, in dispatch\n raise e\n File "/data/projects/fate/eggroll/python/eggroll/core/command/command_router.py", line 91, in dispatch\n
call_result = _method(_instance, deserialized_args)\n File "/data/projects/fate/eggroll/python/eggroll/core/utils.py", line 194, in wrapper\n raise RuntimeError(msg)\nRuntimeError: \n\n==== detail start, at 20221103.172208.546 ====\nTraceback (most recent call last):\n
File "/data/projects/fate/eggroll/python/eggroll/core/utils.py", line 187, in wrapper\n return func(
args,
kw)\n File "/data/projects/fate/eggroll/python/eggroll/roll_pair/egg_pair.py", line 245, in run_task\n shutil.rmtree(path)\n File "/opt/python3/lib/python3.8/shutil.py", line 718, in rmtree\n _rmtree_safe_fd(fd, path, onerror)\n File "/opt/python3/lib/python3.8/shutil.py", line 655, in _rmtree_safe_fd\n
_rmtree_safe_fd(dirfd, fullname, onerror)\n File "/opt/python3/lib/python3.8/shutil.py", line 645, in _rmtree_safe_fd\n onerror(os.lstat, fullname, sys.exc_info())\n File "/opt/python3/lib/python3.8/shutil.py", line 642, in _rmtree_safe_fd\n orig_st = entry.stat(follow_symlinks=False)\nFileNotFoundError: [Errno 2] No such file or directory: '1'\n\n==== detail end ====\n\n\n\n==== detail end ====\n\n","grpc_status":2}"

I Installed kubefate with the below yam files for parties 9999 and 10000.

Party_10000

name: fate-10000
namespace: fate-10000
chartName: fate
chartVersion: v1.9.0
partyId: 10000
registry: ""
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: false
istio:
  enabled: false
podSecurityPolicy:
  enabled: false
ingressClassName: nginx
modules:
  - rollsite
  - clustermanager
  - nodemanager
  - mysql
  - python
  - fateboard
  - client

computing: Eggroll
federation: Eggroll
storage: Eggroll
algorithm: Basic
device: IPCL

ingress:
  fateboard:
    hosts:
    - name: party10000.fateboard.example.com
  client:
    hosts:
    - name: party10000.notebook.example.com

rollsite:
  type: NodePort
  nodePort: 30101
  exchange:
    ip: 192.168.122.20
    port: 30000
  partyList:
  - partyId: 1
    partyIp: 192.168.122.20
    partyPort: 30000
  - partyId: 9999
    partyIp: 192.168.122.20
    partyPort: 30091

python:
  type: NodePort
  httpNodePort: 30107
  grpcNodePort: 30102
  logLevel: INFO

servingIp: 192.168.122.20
servingPort: 30105

Party_9999

name: fate-9999
namespace: fate-9999
chartName: fate
chartVersion: v1.9.0
partyId: 9999
registry: ""
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: false
istio:
  enabled: false
podSecurityPolicy:
  enabled: false
ingressClassName: nginx
modules:
  - rollsite
  - clustermanager
  - nodemanager
  - mysql
  - python
  - fateboard
  - client

computing: Eggroll
federation: Eggroll
storage: Eggroll
algorithm: Basic
device: IPCL

ingress:
  fateboard:
    hosts:
    - name: party9999.fateboard.example.com
  client:
    hosts:
    - name: party9999.notebook.example.com

rollsite:
  type: NodePort
  nodePort: 30091
  exchange:
    ip: 192.168.122.20
    port: 30000
  partyList:
  - partyId: 1
    partyIp: 192.168.122.20
    partyPort: 30000
  - partyId: 10000
    partyIp: 192.168.122.20
    partyPort: 30101

python:
  type: NodePort
  httpNodePort: 30097
  grpcNodePort: 30092
  logLevel: INFO

servingIp: 192.168.122.20
servingPort: 30095

**Exchange** 
name: fate-exchange
namespace: fate-exchange
chartName: fate-exchange
chartVersion: v1.9.0
partyId: 1
registry: ""
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: false
istio:
  enabled: false
podSecurityPolicy:
  enabled: false
modules:
  - rollsite

rollsite:
  type: NodePort
  nodePort: 30000
  enableTLS: false
  partyList:
  - partyId: 10000
    partyIp: 192.168.122.20
    partyPort: 30101
  - partyId: 9999
    partyIp: 192.168.122.20
    partyPort: 30091

Could you please elaborate? what could be the issue?

thanks

owlet42 commented 2 years ago

Remove the rollsite.partyList part of 9999 and 10000.