FedML-AI / FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
https://TensorOpera.ai
Apache License 2.0
4.19k stars 786 forks source link

Error when running fedml.run_simulation() #2213

Closed DuanYuFi closed 3 months ago

DuanYuFi commented 4 months ago

Hi, I am trying to run the example and it seems an error with PyTorch in function convert_numpy_to_torch_data_format in fedml/ml/engine/ml_engine_adapter.py#L9.

(fedml) root@iZ2ze7drohj43seb1x4uc2Z:~/codes/FedML# python main.py --cf fedml_config.yaml 
[FedML-Client @device-id-0] [Wed, 17 Jul 2024 02:11:18.889220714] [INFO] [__init__.py:164:init] args.rank = 0, args.worker_num = 10
[FedML-Client @device-id-0] [Wed, 17 Jul 2024 02:11:18.889699697] [INFO] [ml_engine_adapter.py:147:get_torch_device] args = <fedml.arguments.Arguments object at 0x7f71b045a070>, using_gpu = False, device_id = 0, device_type = cpu
[FedML-Client @device-id-0] [Wed, 17 Jul 2024 02:11:18.889811515] [INFO] [device.py:49:get_device] device = cpu
[FedML-Client @device-id-0] [Wed, 17 Jul 2024 02:11:18.889888048] [INFO] [data_loader.py:21:download_mnist] ./data/mnist/MNIST.zip
[FedML-Client @device-id-0] [Wed, 17 Jul 2024 02:11:18.889940023] [INFO] [data_loader.py:264:load_synthetic_data] load_data. dataset_name = mnist
[FedML-Client @device-id-0] [Wed, 17 Jul 2024 02:11:41.326432704] [INFO] [data_loader.py:123:load_partition_data_mnist] loading data...
[FedML-Client @device-id-0] [Wed, 17 Jul 2024 02:11:41.328559875] [ERROR] [mlops_runtime_log.py:125:handle_exception] Uncaught exception
Traceback (most recent call last):
  File "/root/codes/FedML/main.py", line 4, in <module>
    fedml.run_simulation()
  File "/root/anaconda3/envs/fedml/lib/python3.9/site-packages/fedml/launch_simulation.py", line 22, in run_simulation
    dataset, output_dim = fedml.data.load(args)
  File "/root/anaconda3/envs/fedml/lib/python3.9/site-packages/fedml/data/data_loader.py", line 235, in load
    return load_synthetic_data(args)
  File "/root/anaconda3/envs/fedml/lib/python3.9/site-packages/fedml/data/data_loader.py", line 275, in load_synthetic_data
    ) = load_partition_data_mnist(
  File "/root/anaconda3/envs/fedml/lib/python3.9/site-packages/fedml/data/MNIST/data_loader.py", line 132, in load_partition_data_mnist
    train_batch = batch_data(args, train_data[u], batch_size)
  File "/root/anaconda3/envs/fedml/lib/python3.9/site-packages/fedml/data/MNIST/data_loader.py", line 96, in batch_data
    batched_x, batched_y = ml_engine_adapter.convert_numpy_to_ml_engine_data_format(args, batched_x, batched_y)
  File "/root/anaconda3/envs/fedml/lib/python3.9/site-packages/fedml/ml/engine/ml_engine_adapter.py", line 74, in convert_numpy_to_ml_engine_data_format
    return convert_numpy_to_torch_data_format(args, batched_x, batched_y)
  File "/root/anaconda3/envs/fedml/lib/python3.9/site-packages/fedml/ml/engine/ml_engine_adapter.py", line 16, in convert_numpy_to_torch_data_format
    batched_x = torch.from_numpy(np.asarray(batched_x)).float()  # LR_MINST or other
TypeError: expected np.ndarray (got numpy.ndarray)

It seems related with the version of numpy (my numpy is 1.26.4)? I tried the solutions here but they didn't work.

BTW, FedML cannot run with numpy 2.0 because at somewhere the code uses numpy.float_ which is removed in numpy 2.0.

DuanYuFi commented 3 months ago

Downgrade PyTorch to 2.0.1 and it solved.