delta-mpc / crypten_vfl_demo

vertical federated learning demo with crypten
MIT License
42 stars 6 forks source link

在运行您的项目时有一个问题想要向您请教一下 #1

Open qingqbaby opened 3 years ago

qingqbaby commented 3 years ago

在您的readme中提到了如果想要尝试运行单机纵向联邦学习需要打开三个终端,分别输入命令 RENDEZVOUS=file:///tmp/vfl && WORLD_SIZE=3 && RANK=0 python train_multi.py RENDEZVOUS=file:///tmp/vfl && WORLD_SIZE=3 && RANK=1 python train_multi.py RENDEZVOUS=file:///tmp/vfl && WORLD_SIZE=3 && RANK=2 python train_multi.py 我首先进行了数据处理,之后打开了三个终端,分别激活pytorch1.8. 然后分别在三个终端输入了您给予的命令并回车。

结果在第一个终端报错:

Traceback (most recent call last): File "train_multi.py", line 182, in main() File "train_multi.py", line 165, in main train_dataloader = make_mpc_dataloader(train_filename, batch_size, shuffle=True, drop_last=False) File "train_multi.py", line 61, in make_mpc_dataloader mpc_tensor = load_encrypt_tensor(filename) File "train_multi.py", line 39, in load_encrypt_tensor tensor = crypten.cryptensor(dummy_tensor, src=i) File "/home/zhangziqing/anaconda3/envs/pytorch/lib/python3.8/site-packages/crypten/init.py", line 79, in cryptensor return backend.MPCTensor(*args, *kwargs) File "/home/zhangziqing/anaconda3/envs/pytorch/lib/python3.8/site-packages/crypten/mpc/mpc.py", line 71, in init self._tensor = tensor_name(input, args, **kwargs) File "/home/zhangziqing/anaconda3/envs/pytorch/lib/python3.8/site-packages/crypten/mpc/primitives/arithmetic.py", line 39, in init assert ( AssertionError: invalid tensor source

在第二和第三个终端报错:

Traceback (most recent call last): File "train_multi.py", line 182, in main() File "train_multi.py", line 158, in main crypten.init() File "/home/zhangziqing/anaconda3/envs/pytorch/lib/python3.8/site-packages/crypten/init.py", line 20, in init comm._init(use_threads=False, init_ttp=crypten.mpc.ttp_required()) File "/home/zhangziqing/anaconda3/envs/pytorch/lib/python3.8/site-packages/crypten/communicator/init.py", line 32, in _init cls.initialize(rank, world_size, init_ttp=init_ttp) File "/home/zhangziqing/anaconda3/envs/pytorch/lib/python3.8/site-packages/crypten/communicator/distributed_communicator.py", line 87, in initialize cls.instance = DistributedCommunicator(init_ttp=init_ttp) File "/home/zhangziqing/anaconda3/envs/pytorch/lib/python3.8/site-packages/crypten/communicator/distributed_communicator.py", line 51, in init dist.init_process_group( File "/home/zhangziqing/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 400, in init_process_group _default_pg = _new_process_group_helper( File "/home/zhangziqing/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 471, in _new_process_group_helper pg = ProcessGroupGloo( RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/context.cc:25] rank < size. 2 vs 1 请问是我输入的命令需要调整吗?我输入的就是 RENDEZVOUS=file:///tmp/vfl && WORLD_SIZE=3 && RANK=2 python train_multi.py ` 或者是否应该舍弃双系统,去使用VM构建三个虚拟机呢?(显然这样更容易出现更多的问题)

mh739025250 commented 3 years ago

启动命令改一下,去掉&&,改成: RENDEZVOUS=tcp://192.168.1.100:2345 WORLD_SIZE=3 RANK=0 python train_multi.py RENDEZVOUS=tcp://192.168.1.100:2345 WORLD_SIZE=3 RANK=1 python train_multi.py RENDEZVOUS=tcp://192.168.1.100:2345 WORLD_SIZE=3 RANK=2 python train_multi.py

的确是README里写错了,抱歉 @qingqbaby

qingqbaby commented 3 years ago

万分感谢您对之前问题的解答,在您的代码中我还遇到了三个问题想请教一下。

问题一:

在您的train_multi.py中make_mpc_dataloader函数里有一段代码: seed = (crypten.mpc.MPCTensor.rand(1) * (2 ** 32)).get_plain_text().int().item() generator = torch.Generator() generator.manual_seed(seed) dataloader = DataLoader(dataset, batch_size, shuffle=shuffle, drop_last=drop_last, collate_fn=crypten_collate, generator=generator) 这段代码提示MPCTensor没有rand功能、generator也没有manual_seed。

问题二:

train_multi.py中train_mpc函数for循环中有一行: loss_val = loss(out, ys) 这个loss函数为什么提醒我应该删掉ys呢?我去找了一下定义发现没找到。

问题三:

您的train_multi.py中main函数有一行代码: mpc_loss = crypten.nn.BCEWithLogitsLoss() 报错显示没有这个函数 请问上面三个问题涉及到的函数是您后面有进行改动了吗?

mh739025250 commented 3 years ago

关于问题一,是crypten版本的问题,通过pip安装的crypten 0.1版本的确没有MPCTensor.rand这个方法,通过源码安装crypten(master分支)的话是有的。如果不想通过源码安装crypten,可以使用crypten.mpc.rand来代替。generator是有manual_seed方法的,这个也许是IDE的提示问题?

关于问题二,这个只是IDE的提示,应该不是报错?

关于问题三,也是crypten版本的问题,通过pip安装的crypten 0.1版本的确没有BCEWithLogitsLoss这个方法,通过源码安装crypten(master分支)的话是有的。如果不想通过源码安装crypten,可以修改这一行为mpc_loss = crypten.nn.BCELoss(),同时修改MLP模型,在最后添加一个Sigmoid层,达到同样的效果。

qingqbaby commented 3 years ago

谢谢,问题已经得到解决~

xierongpytorch commented 3 years ago

您好!请问您遇到这个问题了吗? Traceback (most recent call last): File "train_multi.py", line 188, in main() File "train_multi.py", line 163, in main crypten.init() File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/init.py", line 20, in init comm._init(use_threads=False, init_ttp=crypten.mpc.ttp_required()) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/communicator/init.py", line 32, in _init cls.initialize(rank, world_size, init_ttp=init_ttp) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/communicator/distributed_communicator.py", line 87, in initialize cls.instance = DistributedCommunicator(init_ttp=init_ttp) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/communicator/distributed_communicator.py", line 55, in init rank=self.rank, File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 407, in init_process_group timeout=timeout) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 475, in _new_process_group_helper timeout=timeout) RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:761] connect [127.0.1.1]:10644: Connection refused

Sprinter1999 commented 3 years ago

您好!请问您遇到这个问题了吗? Traceback (most recent call last): File "train_multi.py", line 188, in main() File "train_multi.py", line 163, in main crypten.init() File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/init.py", line 20, in init comm._init(use_threads=False, init_ttp=crypten.mpc.ttp_required()) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/communicator/init.py", line 32, in _init cls.initialize(rank, world_size, init_ttp=init_ttp) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/communicator/distributed_communicator.py", line 87, in initialize cls.instance = DistributedCommunicator(init_ttp=init_ttp) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/communicator/distributed_communicator.py", line 55, in init rank=self.rank, File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 407, in init_process_group timeout=timeout) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 475, in _new_process_group_helper timeout=timeout) RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:761] connect [127.0.1.1]:10644: Connection refused

似乎你重新跑一下就行了

Sprinter1999 commented 3 years ago

关于问题一,是crypten版本的问题,通过pip安装的crypten 0.1版本的确没有MPCTensor.rand这个方法,通过源码安装crypten(master分支)的话是有的。如果不想通过源码安装crypten,可以使用crypten.mpc.rand来代替。generator是有manual_seed方法的,这个也许是IDE的提示问题?

关于问题二,这个只是IDE的提示,应该不是报错?

关于问题三,也是crypten版本的问题,通过pip安装的crypten 0.1版本的确没有BCEWithLogitsLoss这个方法,通过源码安装crypten(master分支)的话是有的。如果不想通过源码安装crypten,可以修改这一行为mpc_loss = crypten.nn.BCELoss(),同时修改MLP模型,在最后添加一个Sigmoid层,达到同样的效果。

您好,我遇到了同样的这一系列问题,但我个人在探索的时候卡在了最后一步。crypten不支持Sigmoid调用,所以把torch的模型放到crypten的模型是会报错的,是不是只能下载源码的crypten模块了呢?如果可以的话,可以教一下我怎么去下载源码并且放在服务器上吗?

qingqbaby commented 3 years ago

您好!请问您遇到这个问题了吗? Traceback (most recent call last): File "train_multi.py", line 188, in main() File "train_multi.py", line 163, in main crypten.init() File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/init.py", line 20, in init comm._init(use_threads=False, init_ttp=crypten.mpc.ttp_required()) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/communicator/init.py", line 32, in _init cls.initialize(rank, world_size, init_ttp=init_ttp) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/communicator/distributed_communicator.py", line 87, in initialize cls.instance = DistributedCommunicator(init_ttp=init_ttp) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/communicator/distributed_communicator.py", line 55, in init rank=self.rank, File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 407, in init_process_group timeout=timeout) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 475, in _new_process_group_helper timeout=timeout) RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:761] connect [127.0.1.1]:10644: Connection refused

在终端输入的是不是错了?

Sprinter1999 commented 3 years ago

谢谢,问题已经得到解决~

请问您是采用下载源码的方式解决的吗

Sprinter1999 commented 3 years ago

您好!请问您遇到这个问题了吗? Traceback (most recent call last): File "train_multi.py", line 188, in main() File "train_multi.py", line 163, in main crypten.init() File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/init.py", line 20, in init comm._init(use_threads=False, init_ttp=crypten.mpc.ttp_required()) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/communicator/init.py", line 32, in _init cls.initialize(rank, world_size, init_ttp=init_ttp) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/communicator/distributed_communicator.py", line 87, in initialize cls.instance = DistributedCommunicator(init_ttp=init_ttp) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/communicator/distributed_communicator.py", line 55, in init rank=self.rank, File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 407, in init_process_group timeout=timeout) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 475, in _new_process_group_helper timeout=timeout) RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:761] connect [127.0.1.1]:10644: Connection refused

在终端输入的是不是错了?

应该是,就是重新按照正确的指令输一遍运行即可

qingqbaby commented 3 years ago

谢谢,问题已经得到解决~

请问您是采用下载源码的方式解决的吗

是的,需要下载github上的源码然后解压CrypTen-master cd /home/qingqing/PycharmProjects/CrypTen-master python setup.py install 之后每次运行的时候我都是先setup一下再转去要运行的train_multi.py运行

xierongpytorch commented 3 years ago

您好!请问您遇到了这个问题? 回溯(最近一次调用最后一次): 文件“train_multi.py”,第 188 行,在 main() 文件“train_multi.py”,第 163 行,在 main crypten.init( ) 文件“/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/ INIT py”为20行,在初始化 comm._init(use_threads =假,init_ttp = crypten.mpc。 ttp_required()) 文件“/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/communicator/ INIT py”为32行,在_init cls.initialize(秩,world_size,init_ttp =init_ttp) 文件“/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/communicator/distributed_communicator.py”,第 87 行,初始化 cls.instance = DistributedCommunicator(init_ttp=init_ttp) 文件“/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/crypten/communicator/distributed_communicator.py”,第 55 行,在init rank=self .rank, 文件“/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”,第407行,init_process_group timeout=timeout) 文件“/mnt/DataDisk /conda/envs/syft/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 475, in _new_process_group_helper timeout=timeout) RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp /pair.cc:761] 连接 [127.0.1.1]:10644:连接被拒绝

在终端输入是不是错了?

应该是,就是要按照正确的指令输血跑步

真的可以了,试了好几次失败后才成功的!感谢解答~

Sprinter1999 commented 3 years ago

谢谢,问题已经得到解决~

请问您是采用下载源码的方式解决的吗

是的,需要下载github上的源码然后解压CrypTen-master

`cd /home/qingqing/PycharmProjects/CrypTen-master

python setup.py install`

之后每次运行的时候我都是先setup一下再转去要运行的train_multi.py运行

谢谢!非常感激!

xierongpytorch commented 3 years ago

谢谢,问题已经得到解决~

请问您是采用下载源码的方式解决的吗

是的,需要下载github上的源码然后解压CrypTen-master cd /home/qingqing/PycharmProjects/CrypTen-master python setup.py install 之后每次运行的时候我都是先setup一下再转去要运行的train_multi.py运行

谢谢!非常感激!

不好意思打扰一下!有一些问题想验证和请教一下你们~ 1.复现的结果差距很大,这正常吗? 2.验证其他数据集时发现一个莫名其妙的问题 Traceback (most recent call last): File "train_multi.py", line 236, in main() File "train_multi.py", line 231, in main validate_loss, p, r,acc,f1 = validate_mpc(test_dataloader, mpc_model, mpc_loss) File "train_multi.py", line 169, in validate_mpc return total_loss / count, precision_score(true_ys, pred_ys), recall_score(true_ys, pred_ys), \ File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/utils/validation.py", line 63, in inner_f return f(*args, *kwargs) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/metrics/_classification.py", line 1662, in precision_score zero_division=zero_division) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/utils/validation.py", line 63, in inner_f return f(args, **kwargs) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/metrics/_classification.py", line 1465, in precision_recall_fscore_support pos_label) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/metrics/_classification.py", line 1277, in _check_set_wise_labels y_type, y_true, y_pred = _check_targets(y_true, y_pred) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/metrics/_classification.py", line 93, in _check_targets "and {1} targets".format(type_true, type_pred)) ValueError: Classification metrics can't handle a mix of continuous and binary targets 于是我分别打印了pred_ys和true_ys tpye(pred_ys)=<class 'list'> tpye(true_ys)=<class 'list'> len(pred_ys)=17688 len(true_ys)=17688 pred_ys[0:10]=[1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0] true_ys[0:10]=[3.769989013671875, 0.0, 0.0, 0.0, 0.0, 0.0, 1.279998779296875, 0.0, 16.439987182617188, 0.0] 理论上我的真实值也是0和1,是解密的问题吗?

mh739025250 commented 3 years ago

你这个问题,有点奇怪,true_ys的差距太大了,不太可能是解密时候的精度问题 你可以看一下make_mpc_dataloader这个函数,在这里label默认是输入的最后一列,不知道你的数据集是这样的格式吗? 可能你需要根据自己的数据集,修改下代码试试

Sprinter1999 commented 3 years ago

谢谢,问题已经得到解决~

请问您是采用下载源码的方式解决的吗

是的,需要下载github上的源码然后解压CrypTen-master cd /home/qingqing/PycharmProjects/CrypTen-master python setup.py install 之后每次运行的时候我都是先setup一下再转去要运行的train_multi.py运行

谢谢!非常感激!

不好意思打扰一下!有一些问题想验证和请教一下你们~ 1.复现的结果差距很大,这正常吗? 2.验证其他数据集时发现一个莫名其妙的问题 Traceback (most recent call last): File "train_multi.py", line 236, in main() File "train_multi.py", line 231, in main validate_loss, p, r,acc,f1 = validate_mpc(test_dataloader, mpc_model, mpc_loss) File "train_multi.py", line 169, in validate_mpc return total_loss / count, precision_score(true_ys, pred_ys), recall_score(true_ys, pred_ys), File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/utils/validation.py", line 63, in inner_f return f(*args, *kwargs) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/metrics/_classification.py", line 1662, in precision_score zero_division=zero_division) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/utils/validation.py", line 63, in inner_f return f(args, **kwargs) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/metrics/_classification.py", line 1465, in precision_recall_fscore_support pos_label) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/metrics/_classification.py", line 1277, in _check_set_wise_labels y_type, y_true, y_pred = _check_targets(y_true, y_pred) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/metrics/_classification.py", line 93, in _check_targets "and {1} targets".format(type_true, type_pred)) ValueError: Classification metrics can't handle a mix of continuous and binary targets 于是我分别打印了pred_ys和true_ys tpye(pred_ys)=<class 'list'> tpye(true_ys)=<class 'list'> len(pred_ys)=17688 len(true_ys)=17688 pred_ys[0:10]=[1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0] true_ys[0:10]=[3.769989013671875, 0.0, 0.0, 0.0, 0.0, 0.0, 1.279998779296875, 0.0, 16.439987182617188, 0.0] 理论上我的真实值也是0和1,是解密的问题吗?

我也遇到了类似的问题,导致loss值很大,目前还没来得及仔细去看问题在哪里

xierongpytorch commented 3 years ago

你这个问题,有点奇怪,true_ys的差距太大了,不太可能是解密时候的精度问题 你可以看一下make_mpc_dataloader这个函数,在这里label默认是输入的最后一列,不知道你的数据集是这样的格式吗? 可能你需要根据自己的数据集,修改下代码试试

谢谢您的解答,我考虑了这点,并做了以下处理:

  1. feature, label = mpc_tensor[:,1:], mpc_tensor[:, 0:1]
  2. label=label.squeeze() 还是无果,无奈之下我进一步操作: validate_mpc函数中添加true_ys = torch.where(true_ys > 0, 1.0, 0.0) 我成功运行了,但不知道这样科学吗?
xierongpytorch commented 3 years ago

谢谢,问题已经得到解决~

请问您是采用下载源码的方式解决的吗

是的,需要下载github上的源码然后解压CrypTen-master cd /home/qingqing/PycharmProjects/CrypTen-master python setup.py install 之后每次运行的时候我都是先setup一下再转去要运行的train_multi.py运行

谢谢!非常感激!

不好意思打扰一下!有一些问题想验证和请教一下你们~ 1.复现的结果差距很大,这正常吗? 2.验证其他数据集时发现一个莫名其妙的问题 Traceback (most recent call last): File "train_multi.py", line 236, in main() File "train_multi.py", line 231, in main validate_loss, p, r,acc,f1 = validate_mpc(test_dataloader, mpc_model, mpc_loss) File "train_multi.py", line 169, in validate_mpc return total_loss / count, precision_score(true_ys, pred_ys), recall_score(true_ys, pred_ys), File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/utils/validation.py", line 63, in inner_f return f(*args, *kwargs) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/metrics/_classification.py", line 1662, in precision_score zero_division=zero_division) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/utils/validation.py", line 63, in inner_f return f(args, **kwargs) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/metrics/_classification.py", line 1465, in precision_recall_fscore_support pos_label) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/metrics/_classification.py", line 1277, in _check_set_wise_labels y_type, y_true, y_pred = _check_targets(y_true, y_pred) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/metrics/_classification.py", line 93, in _check_targets "and {1} targets".format(type_true, type_pred)) ValueError: Classification metrics can't handle a mix of continuous and binary targets 于是我分别打印了pred_ys和true_ys tpye(pred_ys)=<class 'list'> tpye(true_ys)=<class 'list'> len(pred_ys)=17688 len(true_ys)=17688 pred_ys[0:10]=[1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0] true_ys[0:10]=[3.769989013671875, 0.0, 0.0, 0.0, 0.0, 0.0, 1.279998779296875, 0.0, 16.439987182617188, 0.0] 理论上我的真实值也是0和1,是解密的问题吗?

我也遇到了类似的问题,导致loss值很大,目前还没来得及仔细去看问题在哪里

确实,不知道哪里的问题。我在上一个回答中提出了一种不太科学的方案,希望可以和你进一步讨论~

mh739025250 commented 3 years ago

你这个问题,有点奇怪,true_ys的差距太大了,不太可能是解密时候的精度问题 你可以看一下make_mpc_dataloader这个函数,在这里label默认是输入的最后一列,不知道你的数据集是这样的格式吗? 可能你需要根据自己的数据集,修改下代码试试

谢谢您的解答,我考虑了这点,并做了以下处理:

  1. feature, label = mpc_tensor[:,1:], mpc_tensor[:, 0:1]
  2. label=label.squeeze() 还是无果,无奈之下我进一步操作: validate_mpc函数中添加true_ys = torch.where(true_ys > 0, 1.0, 0.0) 我成功运行了,但不知道这样科学吗?

感觉不是很科学 你可以先试一下,单纯加密解密一个tensor,会有这么大的偏差吗

Sprinter1999 commented 3 years ago

谢谢,问题已经得到解决~

请问您是采用下载源码的方式解决的吗

是的,需要下载github上的源码然后解压CrypTen-master cd /home/qingqing/PycharmProjects/CrypTen-master python setup.py install 之后每次运行的时候我都是先setup一下再转去要运行的train_multi.py运行

谢谢!非常感激!

不好意思打扰一下!有一些问题想验证和请教一下你们~ 1.复现的结果差距很大,这正常吗? 2.验证其他数据集时发现一个莫名其妙的问题 Traceback (most recent call last): File "train_multi.py", line 236, in main() File "train_multi.py", line 231, in main validate_loss, p, r,acc,f1 = validate_mpc(test_dataloader, mpc_model, mpc_loss) File "train_multi.py", line 169, in validate_mpc return total_loss / count, precision_score(true_ys, pred_ys), recall_score(true_ys, pred_ys), File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/utils/validation.py", line 63, in inner_f return f(*args, *kwargs) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/metrics/_classification.py", line 1662, in precision_score zero_division=zero_division) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/utils/validation.py", line 63, in inner_f return f(args, **kwargs) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/metrics/_classification.py", line 1465, in precision_recall_fscore_support pos_label) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/metrics/_classification.py", line 1277, in _check_set_wise_labels y_type, y_true, y_pred = _check_targets(y_true, y_pred) File "/mnt/DataDisk/conda/envs/syft/lib/python3.7/site-packages/scikit_learn-0.24.2-py3.7-linux-x86_64.egg/sklearn/metrics/_classification.py", line 93, in _check_targets "and {1} targets".format(type_true, type_pred)) ValueError: Classification metrics can't handle a mix of continuous and binary targets 于是我分别打印了pred_ys和true_ys tpye(pred_ys)=<class 'list'> tpye(true_ys)=<class 'list'> len(pred_ys)=17688 len(true_ys)=17688 pred_ys[0:10]=[1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0] true_ys[0:10]=[3.769989013671875, 0.0, 0.0, 0.0, 0.0, 0.0, 1.279998779296875, 0.0, 16.439987182617188, 0.0] 理论上我的真实值也是0和1,是解密的问题吗?

我也遇到了类似的问题,导致loss值很大,目前还没来得及仔细去看问题在哪里

确实,不知道哪里的问题。我在上一个回答中提出了一种不太科学的方案,希望可以和你进一步讨论~

我使用该代码库,数据集也是使用一样的 数据集,但是loss值是非常不正常,以万为单位的值。上述提供的方法我觉得并不合适,目前也在看问题出在哪里。请问您在跑这个原始项目的时候有遇到这种问题吗?

xierongpytorch commented 3 years ago

你在跑这个原始项目的时候有遇到过这种问题吗?

是的,我在复现时也遇到了这种情况,导致精度没能达到作者在readme中所描述的。

xierongpytorch commented 3 years ago

你这个问题,有点奇怪,true_ys的差距太大了,不太可能是解密时候的精度问题 你可以看一下make_mpc_dataloader这个函数,在这里label默认是输入的最后一列,不知道你的数据集是这样的格式吗? 可能你需要根据自己的数据集,修改下代码试试

谢谢您的解答,我考虑了这点,并做了以下处理:

  1. feature, label = mpc_tensor[:,1:], mpc_tensor[:, 0:1]
  2. label=label.squeeze() 还是无果,无奈之下我进一步操作: validate_mpc函数中添加true_ys = torch.where(true_ys > 0, 1.0, 0.0) 我成功运行了,但不知道这样科学吗?

感觉不是很科学 你可以先试一下,单纯加密解密一个tensor,会有这么大的偏差吗

我打印了原项目的解密之后标签的列表,效果是理想的0和1。然而我采用相同的处理方法,在自己的数据上却解密失败了。。。

Sprinter1999 commented 3 years ago

你这个问题,有点奇怪,true_ys的差距太大了,不太可能是解密时候的精度问题 你可以看一下make_mpc_dataloader这个函数,在这里label默认是输入的最后一列,不知道你的数据集是这样的格式吗? 可能你需要根据自己的数据集,修改下代码试试

谢谢您的解答,我考虑了这点,并做了以下处理:

  1. feature, label = mpc_tensor[:,1:], mpc_tensor[:, 0:1]
  2. label=label.squeeze() 还是无果,无奈之下我进一步操作: validate_mpc函数中添加true_ys = torch.where(true_ys > 0, 1.0, 0.0) 我成功运行了,但不知道这样科学吗?

感觉不是很科学 你可以先试一下,单纯加密解密一个tensor,会有这么大的偏差吗

请问您在运行该项目train_multi的时候,loss大约是什么情况。我这边运行结果auc指标是完全达不到readme里面提出的效果,并且loss值还极大:(

clevertension commented 3 years ago

我也发现,loss很大,好几十万级别的

Hiramdu commented 2 years ago

我这边运行50 epoch之后,auc一直在0.49-0.5

chx-Github commented 2 years ago

@mh739025250 ,请问可以分享一下源码压缩文件吗?现在找不到CrypTen-master的,只有CrypTen-main(https://github.com/facebookresearch/CrypTen/tree/main) 但是main分支的运行还是不对,非常感谢!

chx-Github commented 2 years ago

是的,需要下载github上的源码然后解压CrypTen-master cd /home/qingqing/PycharmProjects/CrypTen-master python setup.py install 之后每次运行的时候我都是先setup一下再转去要运行的train_multi.py运行

@qingqbaby 请问可以分享一下源码压缩文件吗?现在找不到CrypTen-master的,只有CrypTen-main (https://github.com/facebookresearch/CrypTen/tree/main) 但是main分支的运行还是不对,非常感谢!