FedML-AI / FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
https://TensorOpera.ai
Apache License 2.0
4.17k stars 784 forks source link

There is a new bug encountered recently when I run a server in Cross_silo: #812

Open 35MAJN opened 1 year ago

35MAJN commented 1 year ago
[FedML-Server(0) @device-id-0] [Mon, 13 Mar 2023 21:38:39] [ERROR] [mlops_runtime_log.py:34:handle_exception] Uncaught exception
Traceback (most recent call last):
  File "/content/FedML/python/examples/cross_silo/mqtt_s3_fedavg_hierarchical_mnist_lr_example/main_fedml_cross_silo_hi.py", line 17, in <module>
    fedml_runner = FedMLRunner(args, device, dataset, model)
  File "/usr/local/lib/python3.9/dist-packages/fedml/runner.py", line 40, in __init__
    self.runner = init_runner_func(
  File "/usr/local/lib/python3.9/dist-packages/fedml/runner.py", line 98, in _init_cross_silo_runner
    runner = Server(
  File "/usr/local/lib/python3.9/dist-packages/fedml/cross_silo/fedml_server.py", line 19, in __init__
    server_initializer.init_server(
  File "/usr/local/lib/python3.9/dist-packages/fedml/cross_silo/server/server_initializer.py", line 42, in init_server
    server_manager.run()
  File "/usr/local/lib/python3.9/dist-packages/fedml/cross_silo/server/fedml_server_manager.py", line 32, in run
    super().run()
  File "/usr/local/lib/python3.9/dist-packages/fedml/core/distributed/fedml_comm_manager.py", line 28, in run
    self.com_manager.handle_receive_message()
  File "/usr/local/lib/python3.9/dist-packages/fedml/core/distributed/communication/mqtt_s3/mqtt_s3_multi_clients_comm_manager.py", line 374, in handle_receive_message
    self.run_loop_forever()
  File "/usr/local/lib/python3.9/dist-packages/fedml/core/distributed/communication/mqtt_s3/mqtt_s3_multi_clients_comm_manager.py", line 128, in run_loop_forever
    self.mqtt_mgr.loop_forever()
  File "/usr/local/lib/python3.9/dist-packages/fedml/core/distributed/communication/mqtt/mqtt_manager.py", line 83, in loop_forever
    self._client.loop_forever(retry_first_connection=True)
  File "/usr/local/lib/python3.9/dist-packages/paho/mqtt/client.py", line 1756, in loop_forever
    rc = self._loop(timeout)
  File "/usr/local/lib/python3.9/dist-packages/paho/mqtt/client.py", line 1164, in _loop
    rc = self.loop_read()
  File "/usr/local/lib/python3.9/dist-packages/paho/mqtt/client.py", line 1556, in loop_read
    rc = self._packet_read()
  File "/usr/local/lib/python3.9/dist-packages/paho/mqtt/client.py", line 2439, in _packet_read
    rc = self._packet_handle()
  File "/usr/local/lib/python3.9/dist-packages/paho/mqtt/client.py", line 3037, in _packet_handle
    return self._handle_pubrel()
  File "/usr/local/lib/python3.9/dist-packages/paho/mqtt/client.py", line 3356, in _handle_pubrel
    self._handle_on_message(message)
  File "/usr/local/lib/python3.9/dist-packages/paho/mqtt/client.py", line 3570, in _handle_on_message
    on_message(self, self._userdata, message)
  File "/usr/local/lib/python3.9/dist-packages/fedml/core/distributed/communication/mqtt/mqtt_manager.py", line 134, in on_message
    passthrough_listener(msg)
  File "/usr/local/lib/python3.9/dist-packages/fedml/core/distributed/communication/mqtt_s3/mqtt_s3_multi_clients_comm_manager.py", line 268, in _on_message
    self._on_message_impl(msg)
  File "/usr/local/lib/python3.9/dist-packages/fedml/core/distributed/communication/mqtt_s3/mqtt_s3_multi_clients_comm_manager.py", line 265, in _on_message_impl
    self._notify(payload_obj)
  File "/usr/local/lib/python3.9/dist-packages/fedml/core/distributed/communication/mqtt_s3/mqtt_s3_multi_clients_comm_manager.py", line 217, in _notify
    observer.receive_message(msg_type, msg_params)
  File "/usr/local/lib/python3.9/dist-packages/fedml/core/distributed/fedml_comm_manager.py", line 45, in receive_message
    handler_callback_func(msg_params)
  File "/usr/local/lib/python3.9/dist-packages/fedml/cross_silo/server/fedml_server_manager.py", line 120, in handle_message_receive_model_from_client
    self.aggregator.add_local_trained_result(
  File "/usr/local/lib/python3.9/dist-packages/fedml/cross_silo/server/fedml_aggregator.py", line 62, in add_local_trained_result
    model_params = ml_engine_adapter.model_params_to_device(self.args, model_params, self.device)
  File "/usr/local/lib/python3.9/dist-packages/fedml/ml/engine/ml_engine_adapter.py", line 251, in model_params_to_device
    for key in params_obj.keys():
AttributeError: 'tuple' object has no attribute 'keys'
fedml-dimitris commented 1 year ago

@35MAJN Can you provide some pointers to reproduce the error?