FedML-AI / FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
https://TensorOpera.ai
Apache License 2.0
4.2k stars 787 forks source link

A bug encountered when using fed_cifar100 in centralized settings. #805

Open KuanKuanQAQ opened 1 year ago

KuanKuanQAQ commented 1 year ago

When using fed_cifar100 in centralized settings, I encountered a bug. In the file fedml/data/data_loader.py, line 559 constructs the test_data_local_dict in the following way:

test_data_local_dict = {
            0: [batch for cid in sorted(test_data_local_dict.keys()) for batch in test_data_local_dict[cid]]
        }

However, in the file fedml/data/fed_cifar100/data_loader.py, only 100 clients have a local test set while there are 500 clients with local training sets.

DEFAULT_TRAIN_CLIENTS_NUM = 500
DEFAULT_TEST_CLIENTS_NUM = 100

This causes all dataloaders for client IDs 200 to 500 in the test_data_local_dict dictionary to be None, making them unsuitable as iterators in the list comprehension.

Here is a temporary solution, but a more formal fix may be necessary:

tmp = {0 : []}
for cid in sorted(test_data_local_dict.keys()):
    if (test_data_local_dict[cid] != None):
        for batch in test_data_local_dict[cid]:
            tmp[0].append(batch)
test_data_local_dict = tmp
GivralNguyen commented 1 year ago

Hi, how were you able to run without the mqtt problem? Thanks.

KuanKuanQAQ commented 1 year ago

Hi, how were you able to run without the mqtt problem? Thanks.

I think I don't need the MQTT-related content, and my GPU server is not suitable for staying online, so I removed the part of checking the network connection at the beginning of the program's execution.

I commented out all the content after line 86 in fedml/cli/env/collect_env.py.

Hope helpful to you.