TsingZ0 / PFLlib

37 traditional FL (tFL) or personalized FL (pFL) algorithms, 3 scenarios, and 20 datasets.
GNU General Public License v2.0
1.45k stars 300 forks source link

关于加速训练的问题 #199

Closed JACKSON1077 closed 2 months ago

JACKSON1077 commented 3 months ago

非常感谢TsingZ对PFLlib这个项目的贡献和开源,也知道大佬在资源开销方面做了不错的优化,我想问一下我怎么能加速我的模型训练,除了更改local_batch_size=10,dataloader的num_workers和使用torch.nn.DataParallel()这三个方法,还有没有别的方法或者别的超参数可以更改

by the way,在多卡服务器上,在main.py中os.environ["CUDA_VISIBLE_DEVICES"] = args.device_id无法切换到我的其他显卡上,即便我指定args.device_id=1,它还是默认在编号为0的卡上跑,不知道你有没有遇到这个问题

chenzhigang00 commented 3 months ago

对于第一个问题,我有一个方法是更改 local_batch_size= train_samples 或者 test_samples, 即对应“Communication-Efficient Learning of Deep Networks from Decentralized Data”论文中 $B=\infty$ (“for MNIST processing all 600 client examples as a single batch per round”)

每次各个客户端都以所拥有的batch_size训练/测试,速度会很快, image 同样的效果可以输入以下参数达到: python main.py -data MNIST -m cnn -algo FedAvg -gr 1000 -did 0 -lr 0.1 -ls 1 -jr 0.1 -nc 100 -lbs -1

代码修改方式:clientbase.py 文件中设置全局变量flag = False, 并且更改以下三个函数使得在命令行处修改-lbs (local batch size)参数为-1时可以加快训练:

  1. _init__函数
    def __init__(self, args, id, train_samples, test_samples, **kwargs):
        global flag

        torch.manual_seed(0)
        self.model = copy.deepcopy(args.model)
        self.algorithm = args.algorithm
        self.dataset = args.dataset
        self.device = args.device
        self.id = id  # integer
        self.save_folder_name = args.save_folder_name

        self.num_classes = args.num_classes
        self.train_samples = train_samples
        self.test_samples = test_samples
        self.batch_size = args.batch_size
        if args.batch_size == -1:
            flag = True
        self.learning_rate = args.local_learning_rate
        self.local_epochs = args.local_epochs

        # check BatchNorm
        self.has_BatchNorm = False
        for layer in self.model.children():
            if isinstance(layer, nn.BatchNorm2d):
                self.has_BatchNorm = True
                break

        self.train_slow = kwargs['train_slow']
        self.send_slow = kwargs['send_slow']
        self.train_time_cost = {'num_rounds': 0, 'total_cost': 0.0}
        self.send_time_cost = {'num_rounds': 0, 'total_cost': 0.0}

        self.loss = nn.CrossEntropyLoss()
        self.optimizer = torch.optim.SGD(self.model.parameters(), lr=self.learning_rate)
        self.learning_rate_scheduler = torch.optim.lr_scheduler.ExponentialLR(
            optimizer=self.optimizer, 
            gamma=args.learning_rate_decay_gamma
        )
        self.learning_rate_decay = args.learning_rate_decay
  1. load_train_data函数
    def load_train_data(self, batch_size=None):
        global flag
        if batch_size == None:
            batch_size = self.batch_size
        if flag:   # 整体训练
            batch_size = self.train_samples
        train_data = read_client_data(self.dataset, self.id, is_train=True)
        return DataLoader(train_data, batch_size, drop_last=True, shuffle=True)
  1. load_test_data函数
    def load_test_data(self, batch_size=None):
        global flag
        if batch_size == None:
            batch_size = self.batch_size
        if flag:
            batch_size = self.test_samples
        test_data = read_client_data(self.dataset, self.id, is_train=False)
        return DataLoader(test_data, batch_size, drop_last=False, shuffle=True)
chenzhigang00 commented 3 months ago

But I am not quite sure whether to modify the global variables batch_size in the file generate_MNIST.py

JACKSON1077 commented 3 months ago

你这个方法是通过更改local_batch_size来实现加速,相当于是把每个client的数据一次性全部喂入模型进行training(或testing),和放大local_batch_size无异(比如local_batch_size=64 or 128),都能实现加速,但比较通用的一个设定是local_batch_size=10

关于generate_MNIST.py里面的batch_size,我的理解是不需要更改的,目的是将训练集和测试集整个加载(不切分batch),然后合并,再调用separate_data()切分数据

916906445 commented 2 months ago

非常感谢TsingZ对PFLlib这个项目的贡献和开源,也知道大佬在资源开销方面做了不错的优化,我想问一下我怎么能加速我的模型训练,除了更改local_batch_size=10,dataloader的num_workers和使用torch.nn.DataParallel()这三个方法,还有没有别的方法或者别的超参数可以更改

by the way,在多卡服务器上,在main.py中os.environ["CUDA_VISIBLE_DEVICES"] = args.device_id无法切换到我的其他显卡上,即便我指定args.device_id=1,它还是默认在编号为0的卡上跑,不知道你有没有遇到这个问题

想请教第二个问题,如何修改代码能实现多卡同时进行训练呢,我指定args.device_id=0,1 ,一般还是默认在第一个编号为0的卡上跑,可能还跟您的问题不太一样,我是默认输入的一个编号上跑,但不会同时实现多卡训练,不知道这样该怎么实现呢?

TsingZ0 commented 2 months ago

But I am not quite sure whether to modify the global variables batch_size in the file generate_MNIST.py

If you reduce the value of batch_size after data has already been assigned to clients, there's no need to modify the batch_size variable in utils/dataset_utils.py. Otherwise, it is recommended to modify the batch_size variable in utils/dataset_utils.py, as it determines the least_samples variable. If the data distributed to each client is sufficient, the least_samples value may not be a concern. However, if num_clients is set too large, least_samples becomes important, as we set drop_last=True for the trainloader. The trainloader might be empty if the client's data is smaller than a single batch.

TsingZ0 commented 2 months ago

想请教第二个问题,如何修改代码能实现多卡同时进行训练呢,我指定args.device_id=0,1 ,一般还是默认在第一个编号为0的卡上跑,可能还跟您的问题不太一样,我是默认输入的一个编号上跑,但不会同时实现多卡训练,不知道这样该怎么实现呢?

PFLlib上的多卡并行支持是要看所使用的模型自己是否支持多卡。如果模型本身支持多卡,比如使用HuggingFace的StableDiffusion,那么只需设置args.device_id=0,1即可,PFLlib会自动支持多卡训练的。

换个思路的话,只需要修改模型的代码,让模型支持并行即可。