No module named 'fate_llm'

hari5g900 commented 10 months ago

What deployment mode you are use? Kubernetes.

What KubeFATE and FATE version you are using? 1.4.5, 1.11.2

What OS you are using for docker-compse or Kubernetes? Please also clear the version of OS.

OS: ubuntu22.04

To Reproduce

2 parties 9999 and 10000 with exchange on 3 vms Trying to run the Resnet pipeline example. Given example of hetero-secureboost pipeline works. Downloaded CIFAR10 dataset from the link provided and unpacked it in "../../../../examples/data/"

What happen?

fateboard error of the job

[ERROR] [2023-10-12 12:16:15,517] [202310121215148135290] [1127:139735548200768] - [task_executor._run_] [line:266]: No module named 'fate_llm'
2 Traceback (most recent call last):
3   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 75, in get_pytorch_model
4     return get_class(
5   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 21, in get_class
6     nn_modules = importlib.import_module(
7   File "/opt/python3/lib/python3.8/importlib/__init__.py", line 127, in import_module
8     return _bootstrap._gcd_import(name[level:], package, level)
9   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
10   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
11   File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
12 ModuleNotFoundError: No module named 'federatedml.nn.model_zoo.resnet'
13 
14 During handling of the above exception, another exception occurred:
15 
16 Traceback (most recent call last):
17   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/serialization.py", line 80, in recover_sequential_from_dict
18     layer, class_name = recover_layer_from_dict(nn_define_dict[k], nn_dict)
19   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/serialization.py", line 45, in recover_layer_from_dict
20     layer = layer.get_pytorch_model()
21   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 81, in get_pytorch_model
22     return get_class(
23   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 21, in get_class
24     nn_modules = importlib.import_module(
25   File "/opt/python3/lib/python3.8/importlib/__init__.py", line 127, in import_module
26     return _bootstrap._gcd_import(name[level:], package, level)
27   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
28   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
29   File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
30   File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
31   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
32   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
33   File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
34   File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
35   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
36   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
37   File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
38 
39 
40 During handling of the above exception, another exception occurred:
63   File "/data/projects/fate/fate/python/federatedml/nn/homo/client.py", line 297, in fit
64     self.trainer_inst, model, optimizer, loss_fn, extra_data = self.init()
65   File "/data/projects/fate/fate/python/federatedml/nn/homo/client.py", line 211, in init
66     model = s.recover_sequential_from_dict(self.nn_define)

Screenshots

Additional context

The resnet model IS present in model zoo folder. No idea why fate-llm is even checked for. It is not in the code.

owlet42 commented 10 months ago

Please ensure that cluster.yaml uses the following configuration:

algorithm: ALL
device: GPU

python:
  resources:
    requests:
      nvidia.com/gpu: 1
    limits:
      nvidia.com/gpu: 1

hari5g900 commented 10 months ago

EDIT: I get a similar module not found error for other examples that use custom models.

Current cluster configuration: Party-9999:

name: fate-9999
namespace: fate-9999
chartName: fate
chartVersion: v1.11.2
partyId: 9999
registry: ""
pullPolicy:
imagePullSecrets: 
- name: myregistrykey
persistence: false
istio:
  enabled: false
podSecurityPolicy:
  enabled: false
ingressClassName: nginx
modules:
  - rollsite
  - clustermanager
  - mysql
  - python
  - nodemanager
  - fateboard
  - client

computing: Eggroll
federation: Eggroll
storage: Eggroll
algorithm: ALL
device: GPU

ingress:
  fateboard: 
    hosts:
    - name: party9999.fateboard.example.com
  client:  
    hosts:
    - name: party9999.notebook.example.com

rollsite: 
  type: NodePort
  nodePort: 30091
  exchange:
    ip: 10.68.107.146
    port: 30000

python:
  type: NodePort
  httpNodePort: 30097
  grpcNodePort: 30092
  logLevel: INFO
  resources:
    requests:
      nvidia.com/gpu: 1
    limits:
      nvidia.com/gpu: 1

nodemanager:
  replicas: 1
  sessionProcessorsPerNode: 1
  resources:
    requests:
      cpu: "1"
      memory: "1Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

Party-10000

name: fate-10000
namespace: fate-10000
chartName: fate
chartVersion: v1.11.2
partyId: 10000
registry: ""
pullPolicy:
imagePullSecrets: 
- name: myregistrykey
persistence: false
istio:
  enabled: false
podSecurityPolicy:
  enabled: false
ingressClassName: nginx
modules:
  - rollsite
  - clustermanager
  - mysql
  - python
  - nodemanager
  - fateboard
  - client

computing: Eggroll
federation: Eggroll
storage: Eggroll
algorithm: ALL
device: GPU

ingress:
  fateboard: 
    hosts:
    - name: party10000.fateboard.example.com
  client:  
    hosts:
    - name: party10000.notebook.example.com

rollsite: 
  type: NodePort
  nodePort: 30101
  exchange:
    ip: 10.68.107.146
    port: 30000

python:
  type: NodePort
  httpNodePort: 30107
  grpcNodePort: 30102
  logLevel: INFO
  resources:
    requests:
      nvidia.com/gpu: 1
    limits:
      nvidia.com/gpu: 1

nodemanager:
  replicas: 1
  sessionProcessorsPerNode: 1
  resources:
    requests:
      cpu: "1"
      memory: "1Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

exchange:

name: fate-exchange
namespace: fate-exchange
chartName: fate-exchange
chartVersion: v1.11.2
partyId: 1
registry: ""
pullPolicy:
imagePullSecrets: 
- name: myregistrykey
persistence: false
istio:
  enabled: false
podSecurityPolicy:
  enabled: false
modules:
  - rollsite

rollsite: 
  type: NodePort
  nodePort: 30000
  enableTLS: false
  partyList:
  - partyId: 9999
    partyIp: 10.68.107.148
    partyPort: 30091
  - partyId: 10000
    partyIp: 10.68.107.106
    partyPort: 30101

Resnet example still does not work. Error in notebook:

2023-10-19 11:23:49.997 | INFO     | pipeline.utils.invoker.job_submitter:monitor_job_status:83 - Job id is 202310191123487920170

2023-10-19 11:23:50.022 | ERROR    | __main__:<module>:1 - An error has been caught in function '<module>', process 'MainProcess' (167), thread 'MainThread' (139873412183872):
Traceback (most recent call last):

  File "/opt/python3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
           │         │     └ {'__name__': '__main__', '__doc__': 'Entry point for launching an IPython kernel.\n\nThis is separate from the ipykernel pack...
           │         └ <code object <module> at 0x7f36cfc13710, file "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel_launcher.py",...
           └ <function _run_code at 0x7f36cfbf3040>
  File "/opt/python3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
         │     └ {'__name__': '__main__', '__doc__': 'Entry point for launching an IPython kernel.\n\nThis is separate from the ipykernel pack...
         └ <code object <module> at 0x7f36cfc13710, file "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel_launcher.py",...
  File "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
    │   └ <bound method Application.launch_instance of <class 'ipykernel.kernelapp.IPKernelApp'>>
    └ <module 'ipykernel.kernelapp' from '/data/projects/python/venv/lib/python3.8/site-packages/ipykernel/kernelapp.py'>
  File "/data/projects/python/venv/lib/python3.8/site-packages/traitlets/config/application.py", line 1043, in launch_instance
    app.start()
    │   └ <function IPKernelApp.start at 0x7f36c4c529d0>
    └ <ipykernel.kernelapp.IPKernelApp object at 0x7f36cfccdcd0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 728, in start
    self.io_loop.start()
    │    │       └ <function BaseAsyncIOLoop.start at 0x7f36c4c0e670>
    │    └ <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f36c298d5b0>
    └ <ipykernel.kernelapp.IPKernelApp object at 0x7f36cfccdcd0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/tornado/platform/asyncio.py", line 195, in start
    self.asyncio_loop.run_forever()
    │    │            └ <function BaseEventLoop.run_forever at 0x7f36c963aaf0>
    │    └ <_UnixSelectorEventLoop running=True closed=False debug=False>
    └ <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f36c298d5b0>
  File "/opt/python3/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
    self._run_once()
    │    └ <function BaseEventLoop._run_once at 0x7f36c963d670>
    └ <_UnixSelectorEventLoop running=True closed=False debug=False>
  File "/opt/python3/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
    handle._run()
    │      └ <function Handle._run at 0x7f36c9c7e430>
    └ <Handle <TaskWakeupMethWrapper object at 0x7f3635f31910>(<Future finis...670>, ...],))>)>
  File "/opt/python3/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
    │    │            │    │           │    └ <member '_args' of 'Handle' objects>
    │    │            │    │           └ <Handle <TaskWakeupMethWrapper object at 0x7f3635f31910>(<Future finis...670>, ...],))>)>
    │    │            │    └ <member '_callback' of 'Handle' objects>
    │    │            └ <Handle <TaskWakeupMethWrapper object at 0x7f3635f31910>(<Future finis...670>, ...],))>)>
    │    └ <member '_context' of 'Handle' objects>
    └ <Handle <TaskWakeupMethWrapper object at 0x7f3635f31910>(<Future finis...670>, ...],))>)>
  File "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 516, in dispatch_queue
    await self.process_one()
          │    └ <function Kernel.process_one at 0x7f36c55353a0>
          └ <ipykernel.ipkernel.IPythonKernel object at 0x7f36c298dc10>
  File "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 505, in process_one
    await dispatch(*args)
          │         └ ([<zmq.sugar.frame.Frame object at 0x7f36c0152d50>, <zmq.sugar.frame.Frame object at 0x7f36c0152ca0>, <zmq.sugar.frame.Frame ...
          └ <bound method Kernel.dispatch_shell of <ipykernel.ipkernel.IPythonKernel object at 0x7f36c298dc10>>
  File "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 412, in dispatch_shell
    await result
          └ <coroutine object Kernel.execute_request at 0x7f36c29b53c0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 740, in execute_request
    reply_content = await reply_content
                          └ <coroutine object IPythonKernel.do_execute at 0x7f3635b23dc0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel/ipkernel.py", line 422, in do_execute
    res = shell.run_cell(
          │     └ <function ZMQInteractiveShell.run_cell at 0x7f36c4c430d0>
          └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7f36c299b1c0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/ipykernel/zmqshell.py", line 540, in run_cell
    return super().run_cell(*args, **kwargs)
                             │       └ {'store_history': True, 'silent': False, 'cell_id': None}
                             └ ('pipeline.fit() # submit pipeline here',)
  File "/data/projects/python/venv/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3009, in run_cell
    result = self._run_cell(
             │    └ <function InteractiveShell._run_cell at 0x7f36c6854c10>
             └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7f36c299b1c0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3064, in _run_cell
    result = runner(coro)
             │      └ <coroutine object InteractiveShell.run_cell_async at 0x7f3635b23140>
             └ <function _pseudo_sync_runner at 0x7f36c68454c0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
    coro.send(None)
    │    └ <method 'send' of 'coroutine' objects>
    └ <coroutine object InteractiveShell.run_cell_async at 0x7f3635b23140>
  File "/data/projects/python/venv/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3269, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
                       │    │             │        │     └ '/tmp/ipykernel_167/4238494827.py'
                       │    │             │        └ [<_ast.Expr object at 0x7f364976b880>]
                       │    │             └ <_ast.Module object at 0x7f364976b7f0>
                       │    └ <function InteractiveShell.run_ast_nodes at 0x7f36c6854ee0>
                       └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7f36c299b1c0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3448, in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
             │    │        │     │              └ False
             │    │        │     └ <ExecutionResult object at 7f3649e3f100, execution_count=19 error_before_exec=None error_in_exec=None info=<ExecutionInfo obj...
             │    │        └ <code object <module> at 0x7f3649c177c0, file "/tmp/ipykernel_167/4238494827.py", line 1>
             │    └ <function InteractiveShell.run_code at 0x7f36c6854f70>
             └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7f36c299b1c0>
  File "/data/projects/python/venv/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
         │         │    │               │    └ {'__name__': '__main__', '__doc__': 'Automatically created module for IPython interactive environment', '__package__': None, ...
         │         │    │               └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7f36c299b1c0>
         │         │    └ <property object at 0x7f36c6844680>
         │         └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7f36c299b1c0>
         └ <code object <module> at 0x7f3649c177c0, file "/tmp/ipykernel_167/4238494827.py", line 1>

> File "/tmp/ipykernel_167/4238494827.py", line 1, in <module>
    pipeline.fit() # submit pipeline here
    │        └ <function PipeLine.fit at 0x7f3649d20040>
    └ <pipeline.backend.pipeline.PipeLine object at 0x7f36214f2e50>

  File "/data/projects/fate/fate/python/fate_client/pipeline/backend/pipeline.py", line 585, in fit
    self._fit_status = self._job_invoker.monitor_job_status(self._train_job_id,
    │    │             │    │            │                  │    └ '202310191123487920170'
    │    │             │    │            │                  └ <pipeline.backend.pipeline.PipeLine object at 0x7f36214f2e50>
    │    │             │    │            └ <function JobInvoker.monitor_job_status at 0x7f3649d1a3a0>
    │    │             │    └ <pipeline.utils.invoker.job_submitter.JobInvoker object at 0x7f3649d23dc0>
    │    │             └ <pipeline.backend.pipeline.PipeLine object at 0x7f36214f2e50>
    │    └ None
    └ <pipeline.backend.pipeline.PipeLine object at 0x7f36214f2e50>

  File "/data/projects/fate/fate/python/fate_client/pipeline/utils/invoker/job_submitter.py", line 85, in monitor_job_status
    ret_code, ret_msg, data = self.query_job(job_id, role, party_id)
                              │    │         │       │     └ '10000'
                              │    │         │       └ 'guest'
                              │    │         └ '202310191123487920170'
                              │    └ <function JobInvoker.query_job at 0x7f3649d1a430>
                              └ <pipeline.utils.invoker.job_submitter.JobInvoker object at 0x7f3649d23dc0>

  File "/data/projects/fate/fate/python/fate_client/pipeline/utils/invoker/job_submitter.py", line 145, in query_job
    data = result["data"][0]
           └ {'data': [], 'retcode': 0, 'retmsg': 'no job could be found'}

IndexError: list index out of range

Error in fateboard of 9999:

1 [ERROR] [2023-10-19 11:32:34,158] [202310191131369401360] [7397:140114593052480] - [task_executor._run_] [line:266]: No module named 'fate_llm.model_zoo.resnet'
2 Traceback (most recent call last):
3   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 75, in get_pytorch_model
4     return get_class(
5   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 21, in get_class
6     nn_modules = importlib.import_module(
7   File "/opt/python3/lib/python3.8/importlib/__init__.py", line 127, in import_module
8     return _bootstrap._gcd_import(name[level:], package, level)
9   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
10   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
11   File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
12 ModuleNotFoundError: No module named 'federatedml.nn.model_zoo.resnet'
13 
14 During handling of the above exception, another exception occurred:
15 
16 Traceback (most recent call last):
17   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/serialization.py", line 80, in recover_sequential_from_dict
18     layer, class_name = recover_layer_from_dict(nn_define_dict[k], nn_dict)
19   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/serialization.py", line 45, in recover_layer_from_dict
20     layer = layer.get_pytorch_model()
21   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 81, in get_pytorch_model
22 
23   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 21, in get_class
24     nn_modules = importlib.import_module(
25   File "/opt/python3/lib/python3.8/importlib/__init__.py", line 127, in import_module
26     return _bootstrap._gcd_import(name[level:], package, level)
27   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
28   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
29   File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
30 ModuleNotFoundError: No module named 'fate_llm.model_zoo.resnet'
31 
32 During handling of the above exception, another exception occurred:
33 
34 Traceback (most recent call last):
35   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 75, in get_pytorch_model
36     return get_class(
37 File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 21, in get_class
38     nn_modules = importlib.import_module(
39   File "/opt/python3/lib/python3.8/importlib/__init__.py", line 127, in import_module
40     return _bootstrap._gcd_import(name[level:], package, level)
41   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
42   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
43   File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
44 ModuleNotFoundError: No module named 'federatedml.nn.model_zoo.resnet'
45 
46 During handling of the above exception, another exception occurred:
47 
48 Traceback (most recent call last):
49   File "/data/projects/fate/fateflow/python/fate_flow/worker/task_executor.py", line 210, in _run_
50     cpn_output = run_object.run(cpn_input)
51   File "/data/projects/fate/fate/python/federatedml/model_base.py", line 239, in run
52     self._run(cpn_input=cpn_input)
53   File "/data/projects/fate/fate/python/federatedml/model_base.py", line 318, in _run
54     this_data_output = func(*params)
55   File "/data/projects/fate/fate/python/federatedml/nn/homo/client.py", line 297, in fit
56     self.trainer_inst, model, optimizer, loss_fn, extra_data = self.init()
57   File "/data/projects/fate/fate/python/federatedml/nn/homo/client.py", line 211, in init
58     model = s.recover_sequential_from_dict(self.nn_define)
59   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/serialization.py", line 86, in recover_sequential_from_dict
60     layer, class_name = recover_layer_from_dict(v, nn_dict)
61   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/serialization.py", line 45, in recover_layer_from_dict
62     layer = layer.get_pytorch_model()
63   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 81, in get_pytorch_model
64     return get_class(
65   File "/data/projects/fate/fate/python/federatedml/nn/backend/torch/cust.py", line 21, in get_class
66     nn_modules = importlib.import_module(
67   File "/opt/python3/lib/python3.8/importlib/__init__.py", line 127, in import_module
68     return _bootstrap._gcd_import(name[level:], package, level)
69   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
70   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
71   File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
72 ModuleNotFoundError: No module named 'fate_llm.model_zoo.resnet'

No error logs in party10000 hetero-secure-boost example works.

hari5g900 commented 10 months ago

Closing issue as I found a fix.

All files created in the jupyter notebooks are stored in the client-0 pod. However, the models are run (pipeline.fit()) in the fateflow container in python-0 pod. Fateflow container has its own copies of models and datasets but any newly created model files or downloaded data files must be manually put into the python-0 pod. This is not intuitive nor efficient.

FederatedAI / KubeFATE

No module named 'fate_llm' #915