FederatedAI / KubeFATE

Manage federated learning workload using cloud native technologies.
Apache License 2.0
420 stars 222 forks source link

No module named 'fate_llm' #915

Closed hari5g900 closed 10 months ago

hari5g900 commented 10 months ago

What deployment mode you are use? Kubernetes.

What KubeFATE and FATE version you are using? 1.4.5, 1.11.2

What OS you are using for docker-compse or Kubernetes? Please also clear the version of OS.

To Reproduce

2 parties 9999 and 10000 with exchange on 3 vms Trying to run the Resnet pipeline example. Given example of hetero-secureboost pipeline works. Downloaded CIFAR10 dataset from the link provided and unpacked it in "../../../../examples/data/"

What happen?

fateboard error of the job

[ERROR] [2023-10-12 12:16:15,517] [202310121215148135290] [1127:139735548200768] - [task_executor._run_] [line:266]: No module named 'fate_llm'
2 Traceback (most recent call last):
3   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 75, in get_pytorch_model
4     return get_class(
5   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 21, in get_class
6     nn_modules = importlib.import_module(
7   File "/opt/python3/lib/python3.8/importlib/__init__.py", line 127, in import_module
8     return _bootstrap._gcd_import(name[level:], package, level)
9   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
10   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
11   File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
12 ModuleNotFoundError: No module named 'federatedml.nn.model_zoo.resnet'
13 
14 During handling of the above exception, another exception occurred:
15 
16 Traceback (most recent call last):
17   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/serialization.py", line 80, in recover_sequential_from_dict
18     layer, class_name = recover_layer_from_dict(nn_define_dict[k], nn_dict)
19   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/serialization.py", line 45, in recover_layer_from_dict
20     layer = layer.get_pytorch_model()
21   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 81, in get_pytorch_model
22     return get_class(
23   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 21, in get_class
24     nn_modules = importlib.import_module(
25   File "/opt/python3/lib/python3.8/importlib/__init__.py", line 127, in import_module
26     return _bootstrap._gcd_import(name[level:], package, level)
27   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
28   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
29   File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
30   File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
31   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
32   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
33   File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
34   File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
35   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
36   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
37   File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
38 
39 
40 During handling of the above exception, another exception occurred:
63   File "/data/projects/fate/fate/python/federatedml/nn/homo/client.py", line 297, in fit
64     self.trainer_inst, model, optimizer, loss_fn, extra_data = self.init()
65   File "/data/projects/fate/fate/python/federatedml/nn/homo/client.py", line 211, in init
66     model = s.recover_sequential_from_dict(self.nn_define)

Screenshots

image

Additional context

The resnet model IS present in model zoo folder. No idea why fate-llm is even checked for. It is not in the code.

owlet42 commented 10 months ago

Please ensure that cluster.yaml uses the following configuration:

algorithm: ALL
device: GPU
python:
  resources:
    requests:
      nvidia.com/gpu: 1
    limits:
      nvidia.com/gpu: 1
hari5g900 commented 10 months ago

EDIT: I get a similar module not found error for other examples that use custom models.

Current cluster configuration: Party-9999:

name: fate-9999
namespace: fate-9999
chartName: fate
chartVersion: v1.11.2
partyId: 9999
registry: ""
pullPolicy:
imagePullSecrets: 
- name: myregistrykey
persistence: false
istio:
  enabled: false
podSecurityPolicy:
  enabled: false
ingressClassName: nginx
modules:
  - rollsite
  - clustermanager
  - mysql
  - python
  - nodemanager
  - fateboard
  - client

computing: Eggroll
federation: Eggroll
storage: Eggroll
algorithm: ALL
device: GPU

ingress:
  fateboard: 
    hosts:
    - name: party9999.fateboard.example.com
  client:  
    hosts:
    - name: party9999.notebook.example.com

rollsite: 
  type: NodePort
  nodePort: 30091
  exchange:
    ip: 10.68.107.146
    port: 30000

python:
  type: NodePort
  httpNodePort: 30097
  grpcNodePort: 30092
  logLevel: INFO
  resources:
    requests:
      nvidia.com/gpu: 1
    limits:
      nvidia.com/gpu: 1

nodemanager:
  replicas: 1
  sessionProcessorsPerNode: 1
  resources:
    requests:
      cpu: "1"
      memory: "1Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

Party-10000

name: fate-10000
namespace: fate-10000
chartName: fate
chartVersion: v1.11.2
partyId: 10000
registry: ""
pullPolicy:
imagePullSecrets: 
- name: myregistrykey
persistence: false
istio:
  enabled: false
podSecurityPolicy:
  enabled: false
ingressClassName: nginx
modules:
  - rollsite
  - clustermanager
  - mysql
  - python
  - nodemanager
  - fateboard
  - client

computing: Eggroll
federation: Eggroll
storage: Eggroll
algorithm: ALL
device: GPU

ingress:
  fateboard: 
    hosts:
    - name: party10000.fateboard.example.com
  client:  
    hosts:
    - name: party10000.notebook.example.com

rollsite: 
  type: NodePort
  nodePort: 30101
  exchange:
    ip: 10.68.107.146
    port: 30000

python:
  type: NodePort
  httpNodePort: 30107
  grpcNodePort: 30102
  logLevel: INFO
  resources:
    requests:
      nvidia.com/gpu: 1
    limits:
      nvidia.com/gpu: 1

nodemanager:
  replicas: 1
  sessionProcessorsPerNode: 1
  resources:
    requests:
      cpu: "1"
      memory: "1Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

exchange:

name: fate-exchange
namespace: fate-exchange
chartName: fate-exchange
chartVersion: v1.11.2
partyId: 1
registry: ""
pullPolicy:
imagePullSecrets: 
- name: myregistrykey
persistence: false
istio:
  enabled: false
podSecurityPolicy:
  enabled: false
modules:
  - rollsite

rollsite: 
  type: NodePort
  nodePort: 30000
  enableTLS: false
  partyList:
  - partyId: 9999
    partyIp: 10.68.107.148
    partyPort: 30091
  - partyId: 10000
    partyIp: 10.68.107.106
    partyPort: 30101

Resnet example still does not work. Error in notebook:

2023-10-19 11:23:49.997 | INFO     | pipeline.utils.invoker.job_submitter:monitor_job_status:83 - Job id is 202310191123487920170

2023-10-19 11:23:50.022 | ERROR    | __main__:<module>:1 - An error has been caught in function '<module>', process 'MainProcess' (167), thread 'MainThread' (139873412183872):
Traceback (most recent call last):

  File "/opt/python3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
           │         │     └ {'__name__': '__main__', '__doc__': 'Entry point for launching an IPython kernel.\n\nThis is separate from the ipykernel pack...
           │         └ <code object <module> at 0x7f36cfc13710, file "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel_launcher.py",...
           └ <function _run_code at 0x7f36cfbf3040>
  File "/opt/python3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
         │     └ {'__name__': '__main__', '__doc__': 'Entry point for launching an IPython kernel.\n\nThis is separate from the ipykernel pack...
         └ <code object <module> at 0x7f36cfc13710, file "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel_launcher.py",...
  File "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
    │   └ <bound method Application.launch_instance of <class 'ipykernel.kernelapp.IPKernelApp'>>
    └ <module 'ipykernel.kernelapp' from '/data/projects/python/venv/lib/python3.8/site-packages/ipykernel/kernelapp.py'>
  File "/data/projects/python/venv/lib/python3.8/site-packages/traitlets/config/application.py", line 1043, in launch_instance
    app.start()
    │   └ <function IPKernelApp.start at 0x7f36c4c529d0>
    └ <ipykernel.kernelapp.IPKernelApp object at 0x7f36cfccdcd0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 728, in start
    self.io_loop.start()
    │    │       └ <function BaseAsyncIOLoop.start at 0x7f36c4c0e670>
    │    └ <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f36c298d5b0>
    └ <ipykernel.kernelapp.IPKernelApp object at 0x7f36cfccdcd0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/tornado/platform/asyncio.py", line 195, in start
    self.asyncio_loop.run_forever()
    │    │            └ <function BaseEventLoop.run_forever at 0x7f36c963aaf0>
    │    └ <_UnixSelectorEventLoop running=True closed=False debug=False>
    └ <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f36c298d5b0>
  File "/opt/python3/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
    self._run_once()
    │    └ <function BaseEventLoop._run_once at 0x7f36c963d670>
    └ <_UnixSelectorEventLoop running=True closed=False debug=False>
  File "/opt/python3/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
    handle._run()
    │      └ <function Handle._run at 0x7f36c9c7e430>
    └ <Handle <TaskWakeupMethWrapper object at 0x7f3635f31910>(<Future finis...670>, ...],))>)>
  File "/opt/python3/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
    │    │            │    │           │    └ <member '_args' of 'Handle' objects>
    │    │            │    │           └ <Handle <TaskWakeupMethWrapper object at 0x7f3635f31910>(<Future finis...670>, ...],))>)>
    │    │            │    └ <member '_callback' of 'Handle' objects>
    │    │            └ <Handle <TaskWakeupMethWrapper object at 0x7f3635f31910>(<Future finis...670>, ...],))>)>
    │    └ <member '_context' of 'Handle' objects>
    └ <Handle <TaskWakeupMethWrapper object at 0x7f3635f31910>(<Future finis...670>, ...],))>)>
  File "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 516, in dispatch_queue
    await self.process_one()
          │    └ <function Kernel.process_one at 0x7f36c55353a0>
          └ <ipykernel.ipkernel.IPythonKernel object at 0x7f36c298dc10>
  File "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 505, in process_one
    await dispatch(*args)
          │         └ ([<zmq.sugar.frame.Frame object at 0x7f36c0152d50>, <zmq.sugar.frame.Frame object at 0x7f36c0152ca0>, <zmq.sugar.frame.Frame ...
          └ <bound method Kernel.dispatch_shell of <ipykernel.ipkernel.IPythonKernel object at 0x7f36c298dc10>>
  File "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 412, in dispatch_shell
    await result
          └ <coroutine object Kernel.execute_request at 0x7f36c29b53c0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 740, in execute_request
    reply_content = await reply_content
                          └ <coroutine object IPythonKernel.do_execute at 0x7f3635b23dc0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel/ipkernel.py", line 422, in do_execute
    res = shell.run_cell(
          │     └ <function ZMQInteractiveShell.run_cell at 0x7f36c4c430d0>
          └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7f36c299b1c0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel/zmqshell.py", line 540, in run_cell
    return super().run_cell(*args, **kwargs)
                             │       └ {'store_history': True, 'silent': False, 'cell_id': None}
                             └ ('pipeline.fit() # submit pipeline here',)
  File "/data/projects/python/venv/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3009, in run_cell
    result = self._run_cell(
             │    └ <function InteractiveShell._run_cell at 0x7f36c6854c10>
             └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7f36c299b1c0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3064, in _run_cell
    result = runner(coro)
             │      └ <coroutine object InteractiveShell.run_cell_async at 0x7f3635b23140>
             └ <function _pseudo_sync_runner at 0x7f36c68454c0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
    coro.send(None)
    │    └ <method 'send' of 'coroutine' objects>
    └ <coroutine object InteractiveShell.run_cell_async at 0x7f3635b23140>
  File "/data/projects/python/venv/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3269, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
                       │    │             │        │     └ '/tmp/ipykernel_167/4238494827.py'
                       │    │             │        └ [<_ast.Expr object at 0x7f364976b880>]
                       │    │             └ <_ast.Module object at 0x7f364976b7f0>
                       │    └ <function InteractiveShell.run_ast_nodes at 0x7f36c6854ee0>
                       └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7f36c299b1c0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3448, in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
             │    │        │     │              └ False
             │    │        │     └ <ExecutionResult object at 7f3649e3f100, execution_count=19 error_before_exec=None error_in_exec=None info=<ExecutionInfo obj...
             │    │        └ <code object <module> at 0x7f3649c177c0, file "/tmp/ipykernel_167/4238494827.py", line 1>
             │    └ <function InteractiveShell.run_code at 0x7f36c6854f70>
             └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7f36c299b1c0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
         │         │    │               │    └ {'__name__': '__main__', '__doc__': 'Automatically created module for IPython interactive environment', '__package__': None, ...
         │         │    │               └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7f36c299b1c0>
         │         │    └ <property object at 0x7f36c6844680>
         │         └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7f36c299b1c0>
         └ <code object <module> at 0x7f3649c177c0, file "/tmp/ipykernel_167/4238494827.py", line 1>

> File "/tmp/ipykernel_167/4238494827.py", line 1, in <module>
    pipeline.fit() # submit pipeline here
    │        └ <function PipeLine.fit at 0x7f3649d20040>
    └ <pipeline.backend.pipeline.PipeLine object at 0x7f36214f2e50>

  File "/data/projects/fate/fate/python/fate_client/pipeline/backend/pipeline.py", line 585, in fit
    self._fit_status = self._job_invoker.monitor_job_status(self._train_job_id,
    │    │             │    │            │                  │    └ '202310191123487920170'
    │    │             │    │            │                  └ <pipeline.backend.pipeline.PipeLine object at 0x7f36214f2e50>
    │    │             │    │            └ <function JobInvoker.monitor_job_status at 0x7f3649d1a3a0>
    │    │             │    └ <pipeline.utils.invoker.job_submitter.JobInvoker object at 0x7f3649d23dc0>
    │    │             └ <pipeline.backend.pipeline.PipeLine object at 0x7f36214f2e50>
    │    └ None
    └ <pipeline.backend.pipeline.PipeLine object at 0x7f36214f2e50>

  File "/data/projects/fate/fate/python/fate_client/pipeline/utils/invoker/job_submitter.py", line 85, in monitor_job_status
    ret_code, ret_msg, data = self.query_job(job_id, role, party_id)
                              │    │         │       │     └ '10000'
                              │    │         │       └ 'guest'
                              │    │         └ '202310191123487920170'
                              │    └ <function JobInvoker.query_job at 0x7f3649d1a430>
                              └ <pipeline.utils.invoker.job_submitter.JobInvoker object at 0x7f3649d23dc0>

  File "/data/projects/fate/fate/python/fate_client/pipeline/utils/invoker/job_submitter.py", line 145, in query_job
    data = result["data"][0]
           └ {'data': [], 'retcode': 0, 'retmsg': 'no job could be found'}

IndexError: list index out of range

Error in fateboard of 9999:

1 [ERROR] [2023-10-19 11:32:34,158] [202310191131369401360] [7397:140114593052480] - [task_executor._run_] [line:266]: No module named 'fate_llm.model_zoo.resnet'
2 Traceback (most recent call last):
3   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 75, in get_pytorch_model
4     return get_class(
5   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 21, in get_class
6     nn_modules = importlib.import_module(
7   File "/opt/python3/lib/python3.8/importlib/__init__.py", line 127, in import_module
8     return _bootstrap._gcd_import(name[level:], package, level)
9   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
10   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
11   File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
12 ModuleNotFoundError: No module named 'federatedml.nn.model_zoo.resnet'
13 
14 During handling of the above exception, another exception occurred:
15 
16 Traceback (most recent call last):
17   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/serialization.py", line 80, in recover_sequential_from_dict
18     layer, class_name = recover_layer_from_dict(nn_define_dict[k], nn_dict)
19   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/serialization.py", line 45, in recover_layer_from_dict
20     layer = layer.get_pytorch_model()
21   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 81, in get_pytorch_model
22 
23   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 21, in get_class
24     nn_modules = importlib.import_module(
25   File "/opt/python3/lib/python3.8/importlib/__init__.py", line 127, in import_module
26     return _bootstrap._gcd_import(name[level:], package, level)
27   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
28   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
29   File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
30 ModuleNotFoundError: No module named 'fate_llm.model_zoo.resnet'
31 
32 During handling of the above exception, another exception occurred:
33 
34 Traceback (most recent call last):
35   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 75, in get_pytorch_model
36     return get_class(
37 File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 21, in get_class
38     nn_modules = importlib.import_module(
39   File "/opt/python3/lib/python3.8/importlib/__init__.py", line 127, in import_module
40     return _bootstrap._gcd_import(name[level:], package, level)
41   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
42   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
43   File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
44 ModuleNotFoundError: No module named 'federatedml.nn.model_zoo.resnet'
45 
46 During handling of the above exception, another exception occurred:
47 
48 Traceback (most recent call last):
49   File "/data/projects/fate/fateflow/python/fate_flow/worker/task_executor.py", line 210, in _run_
50     cpn_output = run_object.run(cpn_input)
51   File "/data/projects/fate/fate/python/federatedml/model_base.py", line 239, in run
52     self._run(cpn_input=cpn_input)
53   File "/data/projects/fate/fate/python/federatedml/model_base.py", line 318, in _run
54     this_data_output = func(*params)
55   File "/data/projects/fate/fate/python/federatedml/nn/homo/client.py", line 297, in fit
56     self.trainer_inst, model, optimizer, loss_fn, extra_data = self.init()
57   File "/data/projects/fate/fate/python/federatedml/nn/homo/client.py", line 211, in init
58     model = s.recover_sequential_from_dict(self.nn_define)
59   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/serialization.py", line 86, in recover_sequential_from_dict
60     layer, class_name = recover_layer_from_dict(v, nn_dict)
61   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/serialization.py", line 45, in recover_layer_from_dict
62     layer = layer.get_pytorch_model()
63   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 81, in get_pytorch_model
64     return get_class(
65   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 21, in get_class
66     nn_modules = importlib.import_module(
67   File "/opt/python3/lib/python3.8/importlib/__init__.py", line 127, in import_module
68     return _bootstrap._gcd_import(name[level:], package, level)
69   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
70   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
71   File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
72 ModuleNotFoundError: No module named 'fate_llm.model_zoo.resnet'

No error logs in party10000 hetero-secure-boost example works.

hari5g900 commented 10 months ago

Closing issue as I found a fix.

All files created in the jupyter notebooks are stored in the client-0 pod. However, the models are run (pipeline.fit()) in the fateflow container in python-0 pod. Fateflow container has its own copies of models and datasets but any newly created model files or downloaded data files must be manually put into the python-0 pod. This is not intuitive nor efficient.