关于加速训练的问题

JACKSON1077 commented 3 months ago

非常感谢TsingZ对PFLlib这个项目的贡献和开源，也知道大佬在资源开销方面做了不错的优化，我想问一下我怎么能加速我的模型训练，除了更改local_batch_size=10，dataloader的num_workers和使用torch.nn.DataParallel()这三个方法，还有没有别的方法或者别的超参数可以更改

by the way，在多卡服务器上，在main.py中os.environ["CUDA_VISIBLE_DEVICES"] = args.device_id无法切换到我的其他显卡上，即便我指定args.device_id=1，它还是默认在编号为0的卡上跑，不知道你有没有遇到这个问题

chenzhigang00 commented 3 months ago

对于第一个问题，我有一个方法是更改 local_batch_size= train_samples 或者 test_samples, 即对应“Communication-Efficient Learning of Deep Networks from Decentralized Data”论文中 $B=\infty$ （“for MNIST processing all 600 client examples as a single batch per round”)

每次各个客户端都以所拥有的batch_size训练/测试，速度会很快, 同样的效果可以输入以下参数达到： python main.py -data MNIST -m cnn -algo FedAvg -gr 1000 -did 0 -lr 0.1 -ls 1 -jr 0.1 -nc 100 -lbs -1

代码修改方式：clientbase.py 文件中设置全局变量flag = False, 并且更改以下三个函数使得在命令行处修改-lbs (local batch size)参数为-1时可以加快训练：

_init__函数

    def __init__(self, args, id, train_samples, test_samples, **kwargs):
        global flag

        torch.manual_seed(0)
        self.model = copy.deepcopy(args.model)
        self.algorithm = args.algorithm
        self.dataset = args.dataset
        self.device = args.device
        self.id = id  # integer
        self.save_folder_name = args.save_folder_name

        self.num_classes = args.num_classes
        self.train_samples = train_samples
        self.test_samples = test_samples
        self.batch_size = args.batch_size
        if args.batch_size == -1:
            flag = True
        self.learning_rate = args.local_learning_rate
        self.local_epochs = args.local_epochs

        # check BatchNorm
        self.has_BatchNorm = False
        for layer in self.model.children():
            if isinstance(layer, nn.BatchNorm2d):
                self.has_BatchNorm = True
                break

        self.train_slow = kwargs['train_slow']
        self.send_slow = kwargs['send_slow']
        self.train_time_cost = {'num_rounds': 0, 'total_cost': 0.0}
        self.send_time_cost = {'num_rounds': 0, 'total_cost': 0.0}

        self.loss = nn.CrossEntropyLoss()
        self.optimizer = torch.optim.SGD(self.model.parameters(), lr=self.learning_rate)
        self.learning_rate_scheduler = torch.optim.lr_scheduler.ExponentialLR(
            optimizer=self.optimizer, 
            gamma=args.learning_rate_decay_gamma
        )
        self.learning_rate_decay = args.learning_rate_decay

load_train_data函数

    def load_train_data(self, batch_size=None):
        global flag
        if batch_size == None:
            batch_size = self.batch_size
        if flag:   # 整体训练
            batch_size = self.train_samples
        train_data = read_client_data(self.dataset, self.id, is_train=True)
        return DataLoader(train_data, batch_size, drop_last=True, shuffle=True)

load_test_data函数

    def load_test_data(self, batch_size=None):
        global flag
        if batch_size == None:
            batch_size = self.batch_size
        if flag:
            batch_size = self.test_samples
        test_data = read_client_data(self.dataset, self.id, is_train=False)
        return DataLoader(test_data, batch_size, drop_last=False, shuffle=True)

chenzhigang00 commented 3 months ago

But I am not quite sure whether to modify the global variables batch_size in the file generate_MNIST.py

JACKSON1077 commented 3 months ago

你这个方法是通过更改local_batch_size来实现加速，相当于是把每个client的数据一次性全部喂入模型进行training（或testing），和放大local_batch_size无异（比如local_batch_size=64 or 128），都能实现加速，但比较通用的一个设定是local_batch_size=10

关于generate_MNIST.py里面的batch_size，我的理解是不需要更改的，目的是将训练集和测试集整个加载（不切分batch），然后合并，再调用separate_data()切分数据

916906445 commented 2 months ago

非常感谢TsingZ对PFLlib这个项目的贡献和开源，也知道大佬在资源开销方面做了不错的优化，我想问一下我怎么能加速我的模型训练，除了更改local_batch_size=10，dataloader的num_workers和使用torch.nn.DataParallel()这三个方法，还有没有别的方法或者别的超参数可以更改

by the way，在多卡服务器上，在main.py中os.environ["CUDA_VISIBLE_DEVICES"] = args.device_id无法切换到我的其他显卡上，即便我指定args.device_id=1，它还是默认在编号为0的卡上跑，不知道你有没有遇到这个问题

想请教第二个问题，如何修改代码能实现多卡同时进行训练呢，我指定args.device_id=0,1 ，一般还是默认在第一个编号为0的卡上跑，可能还跟您的问题不太一样，我是默认输入的一个编号上跑，但不会同时实现多卡训练，不知道这样该怎么实现呢？

TsingZ0 commented 2 months ago

But I am not quite sure whether to modify the global variables batch_size in the file generate_MNIST.py

If you reduce the value of batch_size after data has already been assigned to clients, there's no need to modify the batch_size variable in utils/dataset_utils.py. Otherwise, it is recommended to modify the batch_size variable in utils/dataset_utils.py, as it determines the least_samples variable. If the data distributed to each client is sufficient, the least_samples value may not be a concern. However, if num_clients is set too large, least_samples becomes important, as we set drop_last=True for the trainloader. The trainloader might be empty if the client's data is smaller than a single batch.

TsingZ0 commented 2 months ago

想请教第二个问题，如何修改代码能实现多卡同时进行训练呢，我指定args.device_id=0,1 ，一般还是默认在第一个编号为0的卡上跑，可能还跟您的问题不太一样，我是默认输入的一个编号上跑，但不会同时实现多卡训练，不知道这样该怎么实现呢？

PFLlib上的多卡并行支持是要看所使用的模型自己是否支持多卡。如果模型本身支持多卡，比如使用HuggingFace的StableDiffusion，那么只需设置args.device_id=0,1即可，PFLlib会自动支持多卡训练的。

换个思路的话，只需要修改模型的代码，让模型支持并行即可。

TsingZ0 / PFLlib

关于加速训练的问题 #199