Closed JACKSON1077 closed 2 months ago
对于第一个问题,我有一个方法是更改 local_batch_size= train_samples 或者 test_samples
, 即对应“Communication-Efficient Learning of Deep Networks from Decentralized Data”论文中 $B=\infty$ (“for MNIST processing all 600 client examples as a single batch per round”)
每次各个客户端都以所拥有的batch_size训练/测试,速度会很快,
同样的效果可以输入以下参数达到:
python main.py -data MNIST -m cnn -algo FedAvg -gr 1000 -did 0 -lr 0.1 -ls 1 -jr 0.1 -nc 100 -lbs -1
代码修改方式:clientbase.py
文件中设置全局变量flag = False
, 并且更改以下三个函数使得在命令行处修改-lbs (local batch size)参数为-1
时可以加快训练:
def __init__(self, args, id, train_samples, test_samples, **kwargs):
global flag
torch.manual_seed(0)
self.model = copy.deepcopy(args.model)
self.algorithm = args.algorithm
self.dataset = args.dataset
self.device = args.device
self.id = id # integer
self.save_folder_name = args.save_folder_name
self.num_classes = args.num_classes
self.train_samples = train_samples
self.test_samples = test_samples
self.batch_size = args.batch_size
if args.batch_size == -1:
flag = True
self.learning_rate = args.local_learning_rate
self.local_epochs = args.local_epochs
# check BatchNorm
self.has_BatchNorm = False
for layer in self.model.children():
if isinstance(layer, nn.BatchNorm2d):
self.has_BatchNorm = True
break
self.train_slow = kwargs['train_slow']
self.send_slow = kwargs['send_slow']
self.train_time_cost = {'num_rounds': 0, 'total_cost': 0.0}
self.send_time_cost = {'num_rounds': 0, 'total_cost': 0.0}
self.loss = nn.CrossEntropyLoss()
self.optimizer = torch.optim.SGD(self.model.parameters(), lr=self.learning_rate)
self.learning_rate_scheduler = torch.optim.lr_scheduler.ExponentialLR(
optimizer=self.optimizer,
gamma=args.learning_rate_decay_gamma
)
self.learning_rate_decay = args.learning_rate_decay
def load_train_data(self, batch_size=None):
global flag
if batch_size == None:
batch_size = self.batch_size
if flag: # 整体训练
batch_size = self.train_samples
train_data = read_client_data(self.dataset, self.id, is_train=True)
return DataLoader(train_data, batch_size, drop_last=True, shuffle=True)
def load_test_data(self, batch_size=None):
global flag
if batch_size == None:
batch_size = self.batch_size
if flag:
batch_size = self.test_samples
test_data = read_client_data(self.dataset, self.id, is_train=False)
return DataLoader(test_data, batch_size, drop_last=False, shuffle=True)
But I am not quite sure whether to modify the global variables batch_size
in the file generate_MNIST.py
你这个方法是通过更改local_batch_size来实现加速,相当于是把每个client的数据一次性全部喂入模型进行training(或testing),和放大local_batch_size无异(比如local_batch_size=64 or 128),都能实现加速,但比较通用的一个设定是local_batch_size=10
关于generate_MNIST.py里面的batch_size,我的理解是不需要更改的,目的是将训练集和测试集整个加载(不切分batch),然后合并,再调用separate_data()切分数据
非常感谢TsingZ对PFLlib这个项目的贡献和开源,也知道大佬在资源开销方面做了不错的优化,我想问一下我怎么能加速我的模型训练,除了更改local_batch_size=10,dataloader的num_workers和使用torch.nn.DataParallel()这三个方法,还有没有别的方法或者别的超参数可以更改
by the way,在多卡服务器上,在main.py中os.environ["CUDA_VISIBLE_DEVICES"] = args.device_id无法切换到我的其他显卡上,即便我指定args.device_id=1,它还是默认在编号为0的卡上跑,不知道你有没有遇到这个问题
想请教第二个问题,如何修改代码能实现多卡同时进行训练呢,我指定args.device_id=0,1 ,一般还是默认在第一个编号为0的卡上跑,可能还跟您的问题不太一样,我是默认输入的一个编号上跑,但不会同时实现多卡训练,不知道这样该怎么实现呢?
But I am not quite sure whether to modify the global variables
batch_size
in the filegenerate_MNIST.py
If you reduce the value of batch_size
after data has already been assigned to clients, there's no need to modify the batch_size
variable in utils/dataset_utils.py
. Otherwise, it is recommended to modify the batch_size
variable in utils/dataset_utils.py
, as it determines the least_samples
variable. If the data distributed to each client is sufficient, the least_samples
value may not be a concern. However, if num_clients
is set too large, least_samples
becomes important, as we set drop_last=True
for the trainloader
. The trainloader
might be empty if the client's data is smaller than a single batch.
想请教第二个问题,如何修改代码能实现多卡同时进行训练呢,我指定args.device_id=0,1 ,一般还是默认在第一个编号为0的卡上跑,可能还跟您的问题不太一样,我是默认输入的一个编号上跑,但不会同时实现多卡训练,不知道这样该怎么实现呢?
PFLlib上的多卡并行支持是要看所使用的模型自己是否支持多卡。如果模型本身支持多卡,比如使用HuggingFace的StableDiffusion,那么只需设置args.device_id=0,1
即可,PFLlib会自动支持多卡训练的。
换个思路的话,只需要修改模型的代码,让模型支持并行即可。
非常感谢TsingZ对PFLlib这个项目的贡献和开源,也知道大佬在资源开销方面做了不错的优化,我想问一下我怎么能加速我的模型训练,除了更改local_batch_size=10,dataloader的num_workers和使用torch.nn.DataParallel()这三个方法,还有没有别的方法或者别的超参数可以更改
by the way,在多卡服务器上,在main.py中os.environ["CUDA_VISIBLE_DEVICES"] = args.device_id无法切换到我的其他显卡上,即便我指定args.device_id=1,它还是默认在编号为0的卡上跑,不知道你有没有遇到这个问题