运行afl和qfedavg报错

JiaxiangRen commented 2 years ago

您好，在执行afl baseline的时候出现如下报错 cmd: python main.py --task mnist_classification_cnum100_dist0_skew0_seed0 --model cnn --algorithm afl --num_rounds 2 --num_epochs 1 --learning_rate 0.215 --proportion 0.1 --batch_size 10 --eval_interval 1

在执行qffl的时候也发生报错 cmd: python main.py --task mnist_classification_cnum100_dist0_skew0_seed0 --model cnn --algorithm qfedavg --num_rounds 2 --num_epochs 1 --learning_rate 0.215 --proportion 0.1 --batch_size 10 --eval_interval 1

这也许和communicate的输出格式有关，请问如何修复这个bug？十分感谢！

WwZzz commented 2 years ago

您好，非常感谢提出bug。1）afl的问题来源于fmodule._model_average中使用了if not list的形式判定数组是否为空，将在一些情形下引出bug，目前已修复所有使用相同语句判定数组是否为空的代码，并重新测试了afl（注：afl默认使用全采样进行工作，指定proportion=0.1将无效，因为iterate函数中不进行采样，具体可参照afl原论文Agnostic Federated Learning）；2）qfedavg使用相同命令在本地没有出现同样的bug并成功运行至结束，请问是否方便提供您本地的qfedavg.py文件

2022-05-06 06:27:13 "jzrzy" @.***> 写道：

您好，在执行afl baseline的时候出现如下报错 cmd: python main.py --task mnist_classification_cnum100_dist0_skew0_seed0 --model cnn --algorithm afl --num_rounds 2 --num_epochs 1 --learning_rate 0.215 --proportion 0.1 --batch_size 10 --eval_interval 1

在执行qffl的时候也发生报错 cmd: python main.py --task mnist_classification_cnum100_dist0_skew0_seed0 --model cnn --algorithm qfedavg --num_rounds 2 --num_epochs 1 --learning_rate 0.215 --proportion 0.1 --batch_size 10 --eval_interval 1

这也许和communicate的输出格式有关，请问如何修复这个bug？十分感谢！

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

JiaxiangRen commented 2 years ago

您好，谢谢您的回复，qfedavg的文件如下 from .fedbase import BasicServer, BasicClient import numpy as np from utils import fmodule

class Server(BasicServer): def init(self, option, model, clients, test_data = None): super(Server, self).init(option, model, clients, test_data) self.q = option['q'] self.paras_name = ['q']

def iterate(self, t):
    # sample clients
    self.selected_clients = self.sample()
    # training
    res = self.communicate(self.selected_clients)
    models, train_losses = res['model'], res['loss']
    # plug in the weight updates into the gradient
    grads = [(self.model- model) / self.lr for model in models]
    Deltas = [gi*np.float_power(li + 1e-10, self.q) for gi,li in zip(grads,train_losses)]
    # estimation of the local Lipchitz constant
    hs = [self.q * np.float_power(li + 1e-10, (self.q - 1)) * (gi.norm() ** 2) + 1.0 / self.lr * np.float_power(li + 1e-10, self.q) for gi,li in zip(grads,train_losses)]
    # aggregate
    self.model = self.aggregate(Deltas, hs)
    return

def aggregate(self, Deltas, hs):
    demominator = np.sum(np.asarray(hs))
    scaled_deltas = [delta/demominator for delta in Deltas]
    updates = fmodule._model_sum(scaled_deltas)
    new_model = self.model - updates
    return new_model

class Client(BasicClient): def init(self, option, name='', train_data=None, valid_data=None): super(Client, self).init(option, name, train_data, valid_data)

def reply(self, svr_pkg):
    model = self.unpack(svr_pkg)
    train_loss = self.test(model, 'train')
    self.train(model)
    cpkg = self.pack(model, train_loss)
    return cpkg

def pack(self, model, loss):
    return {
        "model" : model,
        "loss": loss,
    }

WwZzz commented 2 years ago

您好，您给的Client.reply函数中的第二行，跟现在的版本差异是，train_loss = self.test(model, 'train')['loss']，这是因test的返回值也被包装成dict（考虑到不同dataset的metric差异很大）。这里修改后应该可以成功运行。

2022-05-06 13:57:35 "jzrzy" @.***> 写道：

您好，谢谢您的回复，qfedavg的文件如下 from .fedbase import BasicServer, BasicClient import numpy as np from utils import fmodule

class Server(BasicServer): def init(self, option, model, clients, test_data = None): super(Server, self).init(option, model, clients, test_data) self.q = option['q'] self.paras_name = ['q']

def iterate(self, t):

# sample clients

self.selected_clients = self.sample()

# training

res = self.communicate(self.selected_clients)

models, train_losses = res['model'], res['loss']

# plug in the weight updates into the gradient

grads = [(self.model- model) / self.lr for model in models]

Deltas = [gi*np.float_power(li + 1e-10, self.q) for gi,li in zip(grads,train_losses)]

# estimation of the local Lipchitz constant

hs = [self.q * np.float_power(li + 1e-10, (self.q - 1)) * (gi.norm() ** 2) + 1.0 / self.lr * np.float_power(li + 1e-10, self.q) for gi,li in zip(grads,train_losses)]

# aggregate

self.model = self.aggregate(Deltas, hs)

return

def aggregate(self, Deltas, hs):

demominator = np.sum(np.asarray(hs))

scaled_deltas = [delta/demominator for delta in Deltas]

updates = fmodule._model_sum(scaled_deltas)

new_model = self.model - updates

return new_model

class Client(BasicClient): def init(self, option, name='', train_data=None, valid_data=None): super(Client, self).init(option, name, train_data, valid_data)

def reply(self, svr_pkg):

model = self.unpack(svr_pkg)

train_loss = self.test(model, 'train')

self.train(model)

cpkg = self.pack(model, train_loss)

return cpkg

def pack(self, model, loss):

return {

    "model" : model,

    "loss": loss,

}

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

JiaxiangRen commented 2 years ago

谢谢您的回复，我重新clone了现在的代码，已经没有报错。十分感谢！

WwZzz / easyFL

运行afl和qfedavg报错 #11