FedML-AI / FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
https://TensorOpera.ai
Apache License 2.0
4.19k stars 786 forks source link

Macos failed to execute mnist oneline example #648

Open pizhn opened 1 year ago

pizhn commented 1 year ago

Trying to run python/examples/simulation/sp_fedavg_mnist_lr_example/torch_fedavg_mnist_lr_one_line_example.py. It loops forever in server_runner.py:bind_account_and_device_id. Because the request to https://open.fedml.ai/fedmlOpsServer/edges/binding in client_runner.py:bind_account_and_device_id keeps responding a DATA_NO_EXIST_ERROR status_code. Here is the request json to https://open.fedml.ai/fedmlOpsServer/edges/binding (personal information asterisked).

{
   "accountid":"f5b88f5dca344e6faf17809139b89c48",
   "deviceid":"0xC02YP6S1LVCF@MacOS.Edge.Simulator",
   "type":"MacOS",
   "processor":"x86_64",
   "core_type":"x86_64",
   "network":"",
   "role":"client",
   "os_ver":"macOS-12.5.1-x86_64-i386-64bit",
   "memory":"16.0G",
   "ip":"*.*.*.*",
   "extra_infos":{
      "fedml_ver":"0.7.344",
      "exec_path":"/Users/****/Library/Python/3.8/lib/python/site-packages/fedml/__init__.py",
      "os_ver":"macOS-12.5.1-x86_64-i386-64bit",
      "cpu_info":"x86_64",
      "python_ver":"3.8.9 (default, Apr 13 2022, 08:48:07) \n[Clang 13.1.6 (clang-1316.0.21.2.5)]",
      "torch_ver":"1.9.0",
      "mpi_installed":false,
      "cpu_sage":"17%",
      "available_mem":"5.6 G",
      "total_mem":"16.0G"
   },
   "gpu":"None"
}
pizhn commented 1 year ago

It seems like from the MLOps id in your example fedml_config.yaml. It's not valid. So I guess at least it should raise an error otherwise it takes me hrs to locate the issue.