关于性能问题 - Githubissues

RayWang99 commented 1 month ago

上图是官方的性能图。在我们实际测试中，fate 1.9版和2.0版作为对比。在基本排除运行环境的影响后，2.0版的性能普遍较1.9的低一些。例如纵向线性回归： 5w数据量，1.9版本任务运行了23分3秒，2.0版本任务运行了25分13秒纵向逻辑回归： 5w数据量，1.9版本任务运行了23分52秒，2.0版本任务运行了33分30秒

虽然不是非常精确，但是任务运行的硬件是同样的2-3台机器。据观察PSI似乎也不是性能瓶颈。所以来咨询一下，引发这种性能问题的主要原因大概有哪几种可能？

RayWang99 commented 1 month ago

我们的硬件水平比较有限 Guest方系统配置：CentOS Linux release 7.6.1810 (Core)4核心、内存32G、硬盘500G Host1方系统配置：CentOS Linux release 7.6.1810 (Core)8核心、内存32G、硬盘500G Host2方系统配置：Ubuntu 20.04.6 LTS 4核内存32G，硬盘500G

jiejielu-0309 commented 1 month ago

确实，我也发现了该问题。我测试的是fate1.11.4和2.1.0版本，测试环境：本地windows安装的WSL环境，集群部署，单边测试，数据集4000。分别使用了flow table bind 和flow data upload方式绑定数据，最终提交任务完成时间1.11版本都是少于2.1版本的。

dylan-fan commented 1 month ago

问下，你们这个时间是整个任务时间是吗？能否下看每个epoch时间，这样好对比。

RayWang99 commented 1 month ago

问下，你们这个时间是整个任务时间是吗？能否下看每个epoch时间，这样好对比。

1.9那个因为环境切换就不好找了，但是2.0的5w数据，重新执行之后。 25分钟 reader 13s psi 2分钟 linr_0 22分钟 evaluation 40s

[INFO][2024-09-23 17:01:26,667][29090][guest.fit_model][line:223]: self.optimizer set epoch 0 [INFO][2024-09-23 17:03:26,603][29090][guest.fit_model][line:223]: self.optimizer set epoch 1 [INFO][2024-09-23 17:05:26,614][29090][guest.fit_model][line:223]: self.optimizer set epoch 2 226[INFO][2024-09-23 17:07:22,806][29090][guest.fit_model][line:223]: self.optimizer set epoch 3 [INFO][2024-09-23 17:09:21,924][29090][guest.fit_model][line:223]: self.optimizer set epoch 4 [INFO][2024-09-23 17:11:22,433][29090][guest.fit_model][line:223]: self.optimizer set epoch 5 [INFO][2024-09-23 17:13:20,808][29090][guest.fit_model][line:223]: self.optimizer set epoch 6 [INFO][2024-09-23 17:15:26,421][29090][guest.fit_model][line:223]: self.optimizer set epoch 7

RayWang99 commented 1 month ago

问下，你们这个时间是整个任务时间是吗？能否下看每个epoch时间，这样好对比。

我们做整体测试就没法弄那么细，只能说是同一个算法，同一个流程，同一些机器，尽量控制变量只有版本不一样，正因为性能有问题，所以才需要研究具体可能是什么原因。因为如果不是算法本身的原因，那说明肯定其他部分会有原因，不然整体不会慢

yx0090sh commented 1 month ago

机器配置： Centos 7.2 64bit cpu Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz memory 32G use 4core on each host

数据&部署 guest: 5w5 host: 5w300 部署为1+1

batch_size: -1 penalty：L2 optimizer： sgd learning_rate: 0.15 alpha: 0.01 task-core: 4

任务耗时： 1.11.4 LR 总耗时： max_iter=8 40min57s 2.1.0 LR 总耗时：epochs=8 30min20s 2.1.0: reader 5s psi 37s scale 36s lr(epoch=8) 28min44s evaluation 18s

1.11.4 reader 7s data_transform 12s intersect 43s scale 16s lr(epoch=8) 39min28s evaluation 11s 企业微信截图_1727157701924 企业微信截图_17271479592846 企业微信截图_17271503362204 企业微信截图_17271579184822

RayWang99 commented 1 month ago

机器配置： Centos 7.2 64bit cpu Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz memory 32G use 4core on each host

数据&部署 guest: 5w_5 host: 5w_300 部署为1+1

任务耗时： 1.11.4 LR 总耗时： max_iter=8 40min57s 2.1.0 LR 总耗时：epochs=8 30min20s 2.1.0: reader 5s psi 37s scale 36s lr(epoch=8) 28min44s evaluation 18s

1.11.4 reader 7s data_transform 12s intersect 43s scale 16s lr(epoch=8) 39min28s evaluation 11s

我这个版本也是2.1，你这个数据也是5w，运行时长和我们的lr测试接近（30），难道是1.9是速度在预计之外的快？

dylan-fan commented 1 month ago

你上面是跑一个host的任务还是2个host的任务

RayWang99 commented 1 month ago

你上面是跑一个host的任务还是2个host的任务

1 guest + 1 host

yx0090sh commented 1 month ago

机器配置： Centos 7.2 64bit cpu Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz memory 32G use 4core on each host

数据&部署 guest: 5w5 host: 5w300 部署为1+1 任务配置 batch_size: -1 penalty：L2 optimizer： sgd learning_rate: 0.15 alpha: 0.01 task-core: 4

1.9.2 40min46s
reader 8s data_transform 9s intersect 20s scale 13s lr(epoch=8) 39min45s evaluation 11s 企业微信截图_17271706322671 1.9.2　与１.１１.４　是差不多的

RayWang99 commented 1 month ago

机器配置： Centos 7.2 64bit cpu Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz memory 32G use 4core on each host

数据&部署 guest: 5w5 host: 5w300 部署为1+1 任务配置 batch_size: -1 penalty：L2 optimizer： sgd learning_rate: 0.15 alpha: 0.01 task-core: 4

1.9.2 40min46s reader 8s data_transform 9s intersect 20s scale 13s lr(epoch=8) 39min45s evaluation 11s 1.9.2　与１.１１.４　是差不多的

所以现在感觉，从测试结果对比看起来，我们这边的测试结果里，2.1的时间和你那边接近，但是1.9的却快了很多，可能这个现象值得研究一下

mgqa34 commented 1 month ago

机器配置： Centos 7.2 64bit cpu Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz memory 32G use 4core on each host 数据&部署 guest: 5w5 host: 5w300 部署为1+1 任务配置 batch_size: -1 penalty：L2 optimizer： sgd learning_rate: 0.15 alpha: 0.01 task-core: 4 1.9.2 40min46s reader 8s data_transform 9s intersect 20s scale 13s lr(epoch=8) 39min45s evaluation 11s 1.9.2　与１.１１.４　是差不多的

所以现在感觉，从测试结果对比看起来，我们这边的测试结果里，2.1的时间和你那边接近，但是1.9的却快了很多，可能这个现象值得研究一下

请问一下你们那边的硬盘是什么型号的，可否测下硬盘的读写性能并提供下数据？

RayWang99 commented 1 month ago

机器配置： Centos 7.2 64bit cpu Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz memory 32G use 4core on each host 数据&部署 guest: 5w5 host: 5w300 部署为1+1 任务配置 batch_size: -1 penalty：L2 optimizer： sgd learning_rate: 0.15 alpha: 0.01 task-core: 4 1.9.2 40min46s reader 8s data_transform 9s intersect 20s scale 13s lr(epoch=8) 39min45s evaluation 11s 1.9.2　与１.１１.４　是差不多的

所以现在感觉，从测试结果对比看起来，我们这边的测试结果里，2.1的时间和你那边接近，但是1.9的却快了很多，可能这个现象值得研究一下

请问一下你们那边的硬盘是什么型号的，可否测下硬盘的读写性能并提供下数据？

-disk description: SCSI Disk product: Virtual disk vendor: VMware physical id: 0.0.0 bus info: scsi@2:0.0.0 logical name: /dev/sda version: 1.0 size: 500GiB (536GB) capabilities: 7200rpm partitioned partitioned:dos configuration: ansiversion=2 logicalsectorsize=512 sectorsize=512 signature=000c9b7a

是有可能2.1对机械盘的支持不如1.9吗？

dylan-fan commented 1 month ago

最好也看下，你们1.9的结果是否可以复现，以及并发参数怎么设置的。 guest和host的样本数和特征数具体是多少。这些都是影响变量。

RayWang99 commented 1 month ago

最好也看下，你们1.9的结果是否可以复现，以及并发参数怎么设置的。 guest和host的样本数和特征数具体是多少。这些都是影响变量。

复现的事我们会安排一下。但是这些特征数，样本数，我们这个主要做性能测试，对比的是两个不同版本的性能，无论样本数，特征数是多少，肯定要慢一起慢，要快一起快。因为是同样两台机，同样的网络环境，同样的数据样本，同样的算法，同样的配置。我们现在倒不是觉得2.1慢，而是想不明白为什么2.1比1.9慢。

RayWang99 commented 1 month ago

像深度学习，逻辑回归这两个算法，会有更大的差距，之所以选纵向线性回归来研究，只是因为运行速度较快，容易看到结果

RayWang99 commented 1 month ago

麻烦贴下，你们并发度，以及样本数和各自的特征数。也方便，我们内部复现看看。不然容易对不上

最好也看下，你们1.9的结果是否可以复现，以及并发参数怎么设置的。 guest和host的样本数和特征数具体是多少。这些都是影响变量。

这个是1.9的，今天刚测，保持参数，也是10轮。

reader 9s date_transform_0 14s intersection_0 18s hetero_linr_0 20min8s evaluation 12s

673[INFO] [2024-09-25 13:35:58,938] - [hetero_linr_guest.fit] [line:85]: fit_intercept:True 674[INFO] [2024-09-25 13:35:59,247] - [hetero_linr_guest.fit] [line:95]: iter:0 675[INFO] [2024-09-25 13:37:58,830] - [hetero_linr_guest.fit] [line:118]: iter: 0, is_converged: False 676[INFO] [2024-09-25 13:37:58,831] - [hetero_linr_guest.fit] [line:95]: iter:1 677[INFO] [2024-09-25 13:39:55,861] - [hetero_linr_guest.fit] [line:118]: iter: 1, is_converged: False 678[INFO] [2024-09-25 13:39:55,862] - [hetero_linr_guest.fit] [line:95]: iter:2 679[INFO] [2024-09-25 13:41:52,882] - [hetero_linr_guest.fit] [line:118]: iter: 2, is_converged: False 680[INFO] [2024-09-25 13:41:52,882] - [hetero_linr_guest.fit] [line:95]: iter:3 681[INFO] [2024-09-25 13:43:51,288]- [hetero_linr_guest.fit] [line:118]: iter: 3, is_converged: False 682[INFO] [2024-09-25 13:43:51,289] - [hetero_linr_guest.fit] [line:95]: iter:4 683[INFO] [2024-09-25 13:45:49,691] - [hetero_linr_guest.fit] [line:118]: iter: 4, is_converged: False 684[INFO] [2024-09-25 13:45:49,691] - [hetero_linr_guest.fit] [line:95]: iter:5

RayWang99 commented 1 month ago

还有一个提升树的差距更大，我们现在安排一下复现，也把数据发上来

mgqa34 commented 1 month ago

机器配置： Centos 7.2 64bit cpu Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz memory 32G use 4core on each host 数据&部署 guest: 5w5 host: 5w300 部署为1+1 任务配置 batch_size: -1 penalty：L2 optimizer： sgd learning_rate: 0.15 alpha: 0.01 task-core: 4 1.9.2 40min46s reader 8s data_transform 9s intersect 20s scale 13s lr(epoch=8) 39min45s evaluation 11s 1.9.2　与１.１１.４　是差不多的

所以现在感觉，从测试结果对比看起来，我们这边的测试结果里，2.1的时间和你那边接近，但是1.9的却快了很多，可能这个现象值得研究一下

请问一下你们那边的硬盘是什么型号的，可否测下硬盘的读写性能并提供下数据？

-disk description: SCSI Disk product: Virtual disk vendor: VMware physical id: 0.0.0 bus info: scsi@2:0.0.0 logical name: /dev/sda version: 1.0 size: 500GiB (536GB) capabilities: 7200rpm partitioned partitioned:dos configuration: ansiversion=2 logicalsectorsize=512 sectorsize=512 signature=000c9b7a

是有可能2.1对机械盘的支持不如1.9吗？

2.x的设计里面，对算子的操作会更频繁，算子操作在eggroll上会体现为磁盘IO，所以最好提供下硬盘数据，以及提供下特征维度，以及任务运行结束后，在日志后面会打出每个算子的时间耗时，这样可以更好的评估是哪些方面造成的影响，比如特征维度太小，那这个时候可能硬盘IO是个瓶颈，而体现不出计算优化提升。

yx0090sh commented 1 month ago

我想控制下变量我们用同一份数据看看能不能跑出结果数据 12600行22列 linr 1.9.2 reader 5s data_trandform 9s intersect 11s linr 2min34s eval 9s 2.1.0 reader 4s psi 17s linr 60s eval 14s 企业微信截图_17271706322671 企业微信截图_17272588094039 企业微信截图_17272588484283 企业微信截图_17272588986855

数据 guest_train_reg_normal.csv host_train_reg_normal.csv

1.9.2 配置文件 test_hetero_linr_train_job_conf.json test_hetero_linr_train_job_dsl.json

2.1.0 pipeline

linr_0 = CoordinatedLinR("linr_0", epochs=10, batch_size=None, early_stop="weight_diff", learning_rate_scheduler={"method":"linear", "scheduler_params":{"start_factor": 1.0}}, optimizer={"method": "sgd", "optimizer_params": {"lr": 0.15}, "alpha": 0.01}, init_param={"fit_intercept": False}, train_data=psi_0.outputs["output_data"] )

yx0090sh commented 1 month ago

2.1.0 pipeline 需要将txt 改py 文件 test_linr.txt

RayWang99 commented 1 month ago

最好也看下，你们1.9的结果是否可以复现，以及并发参数怎么设置的。 guest和host的样本数和特征数具体是多少。这些都是影响变量。

咨询一个问题，同样一个算法，1.9的默认参数和2.1的默认参数应该是没有改变的吧？为了测试的时候便利，有些参数是默认或者不传的，这个具体体现在执行上时两个版本的默认参数是否是一致的。

yx0090sh commented 1 month ago

在参数的表现形式上会有些差别，但基本是一致的

RayWang99 commented 1 month ago

机器配置： Centos 7.2 64bit cpu Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz memory 32G use 4core on each host 数据&部署 guest: 5w5 host: 5w300 部署为1+1 任务配置 batch_size: -1 penalty：L2 optimizer： sgd learning_rate: 0.15 alpha: 0.01 task-core: 4 1.9.2 40min46s reader 8s data_transform 9s intersect 20s scale 13s lr(epoch=8) 39min45s evaluation 11s 1.9.2　与１.１１.４　是差不多的

所以现在感觉，从测试结果对比看起来，我们这边的测试结果里，2.1的时间和你那边接近，但是1.9的却快了很多，可能这个现象值得研究一下

请问一下你们那边的硬盘是什么型号的，可否测下硬盘的读写性能并提供下数据？

Starting 10 processes Jobs: 1 (f=0): [_____/] [-.-% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:04s]
random-read: (groupid=0, jobs=4): err= 0: pid=59857: Thu Sep 26 14:49:56 2024 read : io=4096.0MB, bw=1254.6MB/s, iops=321156, runt= 3265msec slat (usec): min=3, max=5455, avg= 5.84, stdev=10.44 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], | 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 0], 60.00th=[ 0], | 70.00th=[ 0], 80.00th=[ 0], 90.00th=[ 0], 95.00th=[ 0], | 99.00th=[ 0], 99.50th=[ 0], 99.90th=[ 0], 99.95th=[ 0], | 99.99th=[ 0] cpu : usr=28.65%, sys=69.10%, ctx=1268, majf=0, minf=357 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=1048576/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=64 random-write: (groupid=1, jobs=4): err= 0: pid=59872: Thu Sep 26 14:49:56 2024 write: io=4096.0MB, bw=931240KB/s, iops=232809, runt= 4504msec slat (usec): min=3, max=19329, avg=10.92, stdev=108.40 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], | 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 0], 60.00th=[ 0], | 70.00th=[ 0], 80.00th=[ 0], 90.00th=[ 0], 95.00th=[ 0], | 99.00th=[ 0], 99.50th=[ 0], 99.90th=[ 0], 99.95th=[ 0], | 99.99th=[ 0] cpu : usr=22.80%, sys=63.32%, ctx=14721, majf=0, minf=97 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=64 read: (groupid=2, jobs=1): err= 0: pid=59898: Thu Sep 26 14:49:56 2024 read : io=1024.0MB, bw=865876KB/s, iops=6764, runt= 1211msec slat (usec): min=75, max=1289, avg=95.07, stdev=25.86 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], | 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 0], 60.00th=[ 0], | 70.00th=[ 0], 80.00th=[ 0], 90.00th=[ 0], 95.00th=[ 0], | 99.00th=[ 0], 99.50th=[ 0], 99.90th=[ 0], 99.95th=[ 0], | 99.99th=[ 0] cpu : usr=3.06%, sys=96.69%, ctx=8, majf=0, minf=1561 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.2% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=8192/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=64 write: (groupid=3, jobs=1): err= 0: pid=59912: Thu Sep 26 14:49:56 2024 write: io=1024.0MB, bw=758738KB/s, iops=5927, runt= 1382msec slat (usec): min=98, max=505, avg=116.62, stdev=14.14 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], | 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 0], 60.00th=[ 0], | 70.00th=[ 0], 80.00th=[ 0], 90.00th=[ 0], 95.00th=[ 0], | 99.00th=[ 0], 99.50th=[ 0], 99.90th=[ 0], 99.95th=[ 0], | 99.99th=[ 0] cpu : usr=12.74%, sys=87.18%, ctx=3, majf=0, minf=24 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.2% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=0/w=8192/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs): READ: io=4096.0MB, aggrb=1254.6MB/s, minb=1254.6MB/s, maxb=1254.6MB/s, mint=3265msec, maxt=3265msec

Run status group 1 (all jobs): WRITE: io=4096.0MB, aggrb=931239KB/s, minb=931239KB/s, maxb=931239KB/s, mint=4504msec, maxt=4504msec

Run status group 2 (all jobs): READ: io=1024.0MB, aggrb=865876KB/s, minb=865876KB/s, maxb=865876KB/s, mint=1211msec, maxt=1211msec

Run status group 3 (all jobs): WRITE: io=1024.0MB, aggrb=758738KB/s, minb=758738KB/s, maxb=758738KB/s, mint=1382msec, maxt=1382msec

这个部分是磁盘测试的情况

RayWang99 commented 1 month ago

上图是2.1测试提升树的情况，整体1小时，算法部分51分钟。

1 9提升树这个图是1.9的提升树

yx0090sh commented 1 month ago

请问一下，你任务的数据是怎样的，多少行，多少特征维度，任务的配置是怎样的

RayWang99 commented 1 month ago

请问一下，你任务的数据是怎样的，多少行，多少特征维度，任务的配置是怎样的

数据结构结构如图共5w行。提升树参数模型参数如图，两个版本保持一致

yx0090sh commented 1 month ago

根据你给的图的示例我看你的工作流是不一致的 SecureBoost的例子里面，工作流不一致，应该不可以直接对比，2.1是reader-psi-sbt-evaluation，而1.9是reader->data_transform->sample->scale->binning->one-hot >selection>sbt>evaluation，这里面不确定的因素有下面几个： a. 1.9经过sample之后，数据量是多少 b. 1.9使用binning+onehot，那么binning的分箱数是多少，最终的特征维度是多少 c. ScureBoost内部：会对数据做分箱，然后再统计直方图，1.9因为通过onehot，所以数值只剩下0/1，每个特征只有2个箱，而2.1的工作流没有经过onehot，直接对浮点数分箱，则会分成32个箱，这里导致梯度直方图统计时大小不一致 d. 还有个selection, 这时候1.9 的特征只有0-1了，不具有可比性了

RayWang99 commented 1 month ago

根据你给的图的示例我看你的工作流是不一致的 SecureBoost的例子里面，工作流不一致，应该不可以直接对比，2.1是reader-psi-sbt-evaluation，而1.9是reader->data_transform->sample->scale->binning->one-hot >selection>sbt>evaluation，这里面不确定的因素有下面几个： a. 1.9经过sample之后，数据量是多少 b. 1.9使用binning+onehot，那么binning的分箱数是多少，最终的特征维度是多少 c. ScureBoost内部：会对数据做分箱，然后再统计直方图，1.9因为通过onehot，所以数值只剩下0/1，每个特征只有2个箱，而2.1的工作流没有经过onehot，直接对浮点数分箱，则会分成32个箱，这里导致梯度直方图统计时大小不一致 d. 还有个selection, 这时候1.9 的特征只有0-1了，不具有可比性了

灰色的部分都是跳过不执行的

yx0090sh commented 1 month ago

我这边复现了下1.9.2 与2.1.0 树模型在 5w x10 维度的情况，在这种特征维度下 1.9.2在树模型方面确实要比2.1.0 要快一些企业微信截图_1727345206118 企业微信截图_17273455669201

2.1 sbt pipeline sbt_0 = HeteroSecureBoost("sbt_0", num_trees=5, max_depth=3, gh_pack="false", hist_sub="true", train_data=psi_0.outputs["output_data"]) 1.9.2 5w X 10列 sbt 4min23s 任务总时长 5min7s 2.1.0 5w X 10列 sbt 9min35s 任务总时长 10min28s

FATE-2.1里面采取的设计方式，对算子使用会更频繁，所以IO操作会更频繁一些，而你们测评的任务，特征非常少，除了加密外，加密梯度直方图计算时间占比比较少，所以消耗其他部分。因为瓶颈会更多体现为IO密集型，这里的话，可以设置参数gh_pack为False去进一步提升速度，简化计算流程。另外的话，测试可以增加特征维度，作一个更全面的评测，现实中只有10维特征的情况也比较少。

RayWang99 commented 1 month ago

我这边复现了下1.9.2 与2.1.0 树模型在 5w x10 维度的情况，在这种特征维度下 1.9.2在树模型方面确实要比2.1.0 要快一些

2.1 sbt pipeline sbt_0 = HeteroSecureBoost("sbt_0", num_trees=5, max_depth=3, gh_pack="false", hist_sub="true", train_data=psi_0.outputs["output_data"]) 1.9.2 5w X 10列 sbt 4min23s 任务总时长 5min7s 2.1.0 5w X 10列 sbt 9min35s 任务总时长 10min28s

FATE-2.1里面采取的设计方式，对算子使用会更频繁，所以IO操作会更频繁一些，而你们测评的任务，特征非常少，除了加密外，加密梯度直方图计算时间占比比较少，所以消耗其他部分。因为瓶颈会更多体现为IO密集型，这里的话，可以设置参数gh_pack为False去进一步提升速度，简化计算流程。另外的话，测试可以增加特征维度，作一个更全面的评测，现实中只有10维特征的情况也比较少。

感谢分析，顺便问一下，在硬件层面是否更换为固态硬盘解决IO读取瓶颈之后，就算是当前测试的10列数据，2.1的速度就可以比1.9快。还是说2.1的这个特性会在特征量少的情况下固定的有劣势

yx0090sh commented 1 month ago

在树模型且特征较少的情况下，1.9 是要比2.1快一些的，我们当时测试的时候 10w X 300 的数据维度下，2.1 要比 1.9 的快，LR对硬盘读写操作不多，主要是计算, 有没有跑我上面提供的数据，配置呢

RayWang99 commented 1 month ago

1.9.2 5w X 10列 sbt 4min23s 任务总时长 5min7s

目前默认gh_pack="false",hist_sub="true",，至于维度可以做多测试，我们试试 5w * 300维的

dylan-fan commented 1 month ago

最好设置gh_pack="true",hist_sub="true"。一般用户不用设置这两个参数，系统默认都是true

RayWang99 commented 1 month ago

2 1（5-100）这个是2.1版本，5w数据，guest 100维，host 100维的时长。 1 9（5-100）这个是1.9版的。唯一不同的是，之前没注意到gh_pack="false"，这两个任务都维持了这个参数的值。虽然在200维之下，2.1版的时长变化不大，1.9的有显著增长，但是两者之间仍有较大差距

RayWang99 commented 2 weeks ago

我这边复现了下1.9.2 与2.1.0 树模型在 5w x10 维度的情况，在这种特征维度下 1.9.2在树模型方面确实要比2.1.0 要快一些

2.1 sbt pipeline sbt_0 = HeteroSecureBoost("sbt_0", num_trees=5, max_depth=3, gh_pack="false", hist_sub="true", train_data=psi_0.outputs["output_data"]) 1.9.2 5w X 10列 sbt 4min23s 任务总时长 5min7s 2.1.0 5w X 10列 sbt 9min35s 任务总时长 10min28s

FATE-2.1里面采取的设计方式，对算子使用会更频繁，所以IO操作会更频繁一些，而你们测评的任务，特征非常少，除了加密外，加密梯度直方图计算时间占比比较少，所以消耗其他部分。因为瓶颈会更多体现为IO密集型，这里的话，可以设置参数gh_pack为False去进一步提升速度，简化计算流程。另外的话，测试可以增加特征维度，作一个更全面的评测，现实中只有10维特征的情况也比较少。

是否方便提供你们测试用的300维度的数据，量就多多益善，我们分段进行测试，看看结果多少维度以上2.1会有优势

FederatedAI / FATE

关于性能问题 #5711

上图是2.1测试提升树的情况，整体1小时，算法部分51分钟。