PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
21.66k stars 5.44k forks source link

本地跑cpu版本多线程问题 #5280

Closed CAOYUHUI closed 6 years ago

CAOYUHUI commented 6 years ago

DSSM的示例代码 设置了trainer_count=32 但是训练过程中cpu使用情况显示并没有多线程。 训练速度很很慢,求解答。

peterzhang2029 commented 6 years ago

你好,为了更好的明确问题, 请提供更多的信息, 包括具体的训练速度, 网络配置文件, 训练数据的格式等。 谢谢

CAOYUHUI commented 6 years ago

![Uploading 2.jpg…]() 用的是DSSM的例子,model选的是fc,分类任务,训练数据的格式和给的示例文件的格式一致。

peterzhang2029 commented 6 years ago

你好, 这里的图片没有显示, 麻烦贴上对应的代码等文本信息, 便于大家搜索,谢谢

CAOYUHUI commented 6 years ago

cpu占有率相关信息如下: Cpu0 : 98.3% us, 1.7% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu1 : 1.7% us, 5.3% sy, 0.0% ni, 93.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu2 : 2.0% us, 5.3% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si Cpu3 : 2.0% us, 5.6% sy, 0.0% ni, 92.4% id, 0.0% wa, 0.0% hi, 0.0% si Cpu4 : 1.7% us, 5.6% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si Cpu5 : 1.7% us, 5.6% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si Cpu6 : 2.0% us, 5.3% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si Cpu7 : 2.0% us, 5.3% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si Cpu8 : 3.0% us, 6.3% sy, 0.3% ni, 90.4% id, 0.0% wa, 0.0% hi, 0.0% si Cpu9 : 2.3% us, 7.6% sy, 0.0% ni, 90.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu10 : 2.0% us, 5.3% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si Cpu11 : 2.3% us, 5.0% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si Cpu12 : 2.0% us, 5.3% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si Cpu13 : 1.7% us, 5.6% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si Cpu14 : 2.0% us, 5.3% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si Cpu15 : 2.0% us, 5.3% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si Cpu16 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu17 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu18 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu19 : 0.3% us, 0.3% sy, 0.0% ni, 99.3% id, 0.0% wa, 0.0% hi, 0.0% si Cpu20 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu21 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu22 : 0.0% us, 0.0% sy, 0.0% ni, 98.3% id, 1.7% wa, 0.0% hi, 0.0% si Cpu23 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu24 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu25 : 0.0% us, 0.3% sy, 0.0% ni, 99.7% id, 0.0% wa, 0.0% hi, 0.0% si Cpu26 : 0.0% us, 0.3% sy, 0.0% ni, 99.7% id, 0.0% wa, 0.0% hi, 0.0% si Cpu27 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu28 : 0.0% us, 0.0% sy, 0.3% ni, 99.7% id, 0.0% wa, 0.0% hi, 0.0% si Cpu29 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu30 : 0.0% us, 0.0% sy, 0.3% ni, 99.7% id, 0.0% wa, 0.0% hi, 0.0% si Cpu31 : 0.7% us, 0.3% sy, 0.0% ni, 99.0% id, 0.0% wa, 0.0% hi, 0.0% si

运行这个示例的脚本如下: ${python} train.py -y 0 --model_arch 0 --class_num 2 \ --train_data_dir './data/train/' \ --test_data_path './data/test_data.txt' \ --source_dic_path './data/dict' \ --target_dic_path './data/dict' \ --batch_size 1024 \ --num_passes 50 \ --dnn_dims '128,64,32' \ --num_workers 10 \ --model_output_prefix './models/' \ --share_embed True \

windy444 commented 6 years ago

我这边也碰到了类似的问题。今天更新到最新版本后,就出现了。 代码是在nce这个上改的,基本只改了输入数据读取部分 https://github.com/PaddlePaddle/models/tree/641554898bae59e68a909e299c84074a645d5464/nce_cost

paddle.init(use_gpu=False, trainer_count=24) optimizer = paddle.optimizer.Adam(learning_rate=3e-2) trainer.train( paddle.batch( paddle.reader.shuffle( lambda: reader.train_reader(train_data, word_dict, 5)(), buf_size=8000), 6400), num_passes=1000, event_handler=event_handler)

目前cpu使用情况 top - 15:55:19 up 423 days, 2:01, 11 users, load average: 17.92, 27.74, 27.97 Tasks: 609 total, 1 running, 608 sleeping, 0 stopped, 0 zombie Cpu0 : 91.4%us, 7.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.7%si, 0.0%st Cpu1 : 8.7%us, 41.7%sy, 0.0%ni, 49.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 6.3%us, 42.9%sy, 0.3%ni, 50.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 7.0%us, 42.4%sy, 0.0%ni, 50.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 7.6%us, 41.7%sy, 0.0%ni, 50.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 8.3%us, 41.3%sy, 0.0%ni, 50.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 7.6%us, 41.7%sy, 0.0%ni, 50.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 7.0%us, 42.4%sy, 0.0%ni, 50.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 7.9%us, 42.7%sy, 0.3%ni, 49.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 8.3%us, 41.7%sy, 0.3%ni, 49.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 8.3%us, 41.5%sy, 0.0%ni, 50.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 8.3%us, 41.4%sy, 0.3%ni, 50.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu12 : 8.0%us, 42.0%sy, 0.0%ni, 50.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu13 : 8.4%us, 41.8%sy, 0.0%ni, 49.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu14 : 9.6%us, 40.9%sy, 0.0%ni, 49.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu15 : 7.6%us, 42.2%sy, 0.3%ni, 46.5%id, 3.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu16 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu17 : 0.3%us, 0.0%sy, 0.0%ni, 99.0%id, 0.7%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 1.0%us, 0.3%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu24 : 0.7%us, 1.3%sy, 0.0%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu25 : 1.0%us, 1.7%sy, 0.0%ni, 97.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu26 : 0.7%us, 1.0%sy, 0.3%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu27 : 1.7%us, 1.3%sy, 0.3%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu28 : 0.3%us, 1.7%sy, 0.7%ni, 97.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu29 : 0.7%us, 2.0%sy, 1.0%ni, 96.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu30 : 1.0%us, 1.7%sy, 1.7%ni, 95.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu31 : 1.0%us, 2.0%sy, 0.7%ni, 94.7%id, 1.7%wa, 0.0%hi, 0.0%si, 0.0%st 但在之前版本,每个核差不多能用到50%的。整体运行时间是之前的5倍左右。

装最新版本的时候出现过这个错 PaddlePaddle wasn't compiled to use avx instructions, but these are available on your machine 后面修复了,但是发现修复前后cpu表现是类似的,运行时间也类似。

另外,还有个问题,我始终没有试出来,单机多线程和单机单线程,在运行时间上有什么差别,batch_size,学习率都调整过,但是始终没看到效果。但至少之前cpu使用率是上去了。

typhoonzero commented 6 years ago

上面的两个情况,可以看到CPU0的利用率已经完全满了,Cpu0 : 91.4%us, 7.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.7%si, 0.0%st。Linux中CPU0通常会处理系统终端,调度等工作,会成为SMP处理的瓶颈。

建议的处理方法:使用linux taskset设置CPU亲和性,设置paddle进程不使用CPU0。或者找到其他占用CPU0的进程,设置其使用CPU0意外以外的CPU。另外,如果是网卡终端处理缓慢,也可以考虑使用irq_balance之类的工具。

参考:

https://linux.die.net/man/1/taskset

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Affinity.html

https://unix.stackexchange.com/questions/73/how-can-i-set-the-processor-affinity-of-a-process-on-linux

windy444 commented 6 years ago

我贴的那个不能算完全满吧,CPU0,90%左右,最高也就93% @typhoonzero

设置亲和到2的情况。除了这个任务,没有其他任务大量占用资源的

Cpu0 : 90.7%us, 2.3%sy, 0.0%ni, 6.6%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu1 : 4.0%us, 20.1%sy, 0.0%ni, 75.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 11.8%us, 19.1%sy, 0.3%ni, 68.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 4.3%us, 19.9%sy, 0.0%ni, 75.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 4.7%us, 19.6%sy, 0.7%ni, 75.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 4.7%us, 19.6%sy, 0.0%ni, 75.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 4.7%us, 19.3%sy, 0.0%ni, 76.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 4.0%us, 19.9%sy, 0.0%ni, 76.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 4.3%us, 20.3%sy, 0.3%ni, 75.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 4.0%us, 20.1%sy, 0.0%ni, 75.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 4.0%us, 20.6%sy, 0.0%ni, 75.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 5.0%us, 19.9%sy, 0.3%ni, 74.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu12 : 5.0%us, 19.6%sy, 0.0%ni, 75.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu13 : 5.0%us, 19.6%sy, 0.3%ni, 75.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu14 : 6.0%us, 19.0%sy, 0.3%ni, 74.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu15 : 5.0%us, 19.9%sy, 0.0%ni, 75.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu16 : 0.0%us, 1.0%sy, 0.3%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu17 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 0.3%us, 0.0%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu24 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu25 : 0.7%us, 0.7%sy, 0.3%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu26 : 0.7%us, 1.0%sy, 0.0%ni, 97.4%id, 1.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu27 : 0.3%us, 1.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu28 : 0.3%us, 0.7%sy, 1.0%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu29 : 0.7%us, 1.3%sy, 0.3%ni, 97.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu30 : 0.3%us, 2.0%sy, 0.7%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu31 : 1.3%us, 2.3%sy, 0.3%ni, 96.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

typhoonzero commented 6 years ago

@windy444 Cpu0 : 91.4%us, 7.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.7%si, 0.0%st 这个数值中,us, sy, si都是占用,分别是用户态CPU,内核态CPU,软中断。id是空闲,可以看到CPU0是100%的。

第二个CPU占用情况中, 如果没有其他的任务,CPU0也有很高的占用率。

另外也可以试试是否是reader成为了训练瓶颈,使用paddle.v2.reader.buffered缓存reader数据,提升吞吐。参考:http://doc.paddlepaddle.org/release/0.10.0/doc/api/v2/data.html#reader

Yancey1989 commented 6 years ago

Hi @windy444 ,可以看下是不是reader的部分消耗的CPU(Python程序只占一个Core)

可以尝试使用Python的Profile工具:https://docs.python.org/2/library/profile.html

windy444 commented 6 years ago

@typhoonzero 我用了下buffered,但是实际时长差不了太多。不知道是不是用错了。 trainer.train( paddle.batch( paddle.reader.shuffle( lambda: paddle.reader.buffered(reader.train_reader(train_data, word_dict, 5), 1000)(), buf_size=2000), 640), num_passes=1000, event_handler=event_handler)

另外,发现我即使用一个线程跑,也是很多核被占用。用24个核的话,情况和这个差不多。 top - 14:59:56 up 427 days, 1:05, 9 users, load average: 10.70, 8.82, 6.54 Tasks: 609 total, 1 running, 608 sleeping, 0 stopped, 0 zombie Cpu0 : 99.7%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu1 : 10.3%us, 45.5%sy, 0.3%ni, 43.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 10.6%us, 45.2%sy, 0.0%ni, 44.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 11.6%us, 44.5%sy, 0.0%ni, 43.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 10.6%us, 45.2%sy, 0.0%ni, 44.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 10.0%us, 45.8%sy, 0.0%ni, 44.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 9.9%us, 45.9%sy, 0.3%ni, 43.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 11.6%us, 44.2%sy, 0.3%ni, 43.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 11.0%us, 46.2%sy, 0.7%ni, 42.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 10.3%us, 45.8%sy, 0.0%ni, 43.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 11.3%us, 45.7%sy, 0.0%ni, 43.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 10.6%us, 46.0%sy, 0.3%ni, 43.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu12 : 10.3%us, 46.2%sy, 0.3%ni, 43.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu13 : 10.6%us, 45.8%sy, 0.0%ni, 43.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu14 : 11.6%us, 44.7%sy, 0.3%ni, 43.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu15 : 11.6%us, 44.2%sy, 0.3%ni, 43.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu16 : 0.0%us, 3.3%sy, 1.0%ni, 95.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu17 : 0.3%us, 3.3%sy, 0.7%ni, 95.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 1.0%us, 3.7%sy, 0.7%ni, 94.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 0.0%us, 2.0%sy, 0.3%ni, 97.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 0.3%us, 1.3%sy, 0.3%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 0.0%us, 2.7%sy, 1.7%ni, 95.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 0.7%us, 4.7%sy, 1.7%ni, 93.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 0.7%us, 2.0%sy, 0.7%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu24 : 0.7%us, 7.0%sy, 2.0%ni, 90.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu25 : 0.3%us, 11.0%sy, 4.0%ni, 84.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu26 : 0.7%us, 6.4%sy, 2.0%ni, 91.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu27 : 2.0%us, 6.3%sy, 1.7%ni, 90.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu28 : 0.3%us, 16.8%sy, 1.7%ni, 80.5%id, 0.7%wa, 0.0%hi, 0.0%si, 0.0%st Cpu29 : 0.3%us, 8.0%sy, 2.0%ni, 89.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu30 : 0.0%us, 6.4%sy, 3.7%ni, 89.6%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu31 : 0.3%us, 9.0%sy, 1.7%ni, 89.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

typhoonzero commented 6 years ago

@windy444 用法是正确的,bufsize可以设置大一点。另外可以看下空闲状态的CPU利用率么。

windy444 commented 6 years ago

@typhoonzero 空闲时刻CPU情况 top - 17:55:14 up 427 days, 4:00, 9 users, load average: 5.24, 7.62, 5.55 Tasks: 602 total, 1 running, 601 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 0.3%sy, 0.0%ni, 98.7%id, 1.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.3%us, 0.3%sy, 0.3%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.3%us, 0.0%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 0.0%us, 0.3%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.3%us, 0.3%sy, 0.7%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 0.0%us, 0.7%sy, 0.7%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu12 : 0.3%us, 0.0%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu13 : 0.0%us, 0.7%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu14 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu15 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu16 : 0.0%us, 0.7%sy, 1.4%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu17 : 0.3%us, 1.0%sy, 0.3%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 0.7%us, 1.3%sy, 0.0%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu24 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu25 : 0.3%us, 0.7%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu26 : 0.3%us, 0.7%sy, 0.3%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu27 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu28 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu29 : 0.3%us, 0.7%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu30 : 0.0%us, 0.3%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu31 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

bufsize加大后 --use_gpu=False --trainer_count=1 trainer.train( paddle.batch( paddle.reader.shuffle( lambda: paddle.reader.buffered(reader.train_reader(train_data, word_dict, 5), 100000)(), buf_size=2000), 640), num_passes=1000, event_handler=event_handler)

top - 17:58:02 up 427 days, 4:03, 10 users, load average: 5.02, 5.60, 5.06 Tasks: 609 total, 2 running, 607 sleeping, 0 stopped, 0 zombie Cpu0 : 99.3%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu1 : 11.6%us, 45.2%sy, 0.3%ni, 42.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 11.2%us, 45.4%sy, 0.3%ni, 42.8%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu3 : 10.6%us, 48.3%sy, 0.0%ni, 41.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 10.0%us, 46.8%sy, 0.0%ni, 43.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 10.9%us, 45.7%sy, 0.3%ni, 43.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 11.9%us, 44.9%sy, 0.0%ni, 43.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 11.3%us, 45.5%sy, 0.3%ni, 42.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 10.6%us, 46.0%sy, 0.0%ni, 43.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 10.6%us, 46.2%sy, 0.0%ni, 43.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 11.3%us, 45.4%sy, 0.0%ni, 43.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 10.9%us, 46.2%sy, 0.0%ni, 42.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu12 : 9.2%us, 47.4%sy, 0.0%ni, 43.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu13 : 11.3%us, 45.4%sy, 0.0%ni, 43.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu14 : 11.5%us, 45.1%sy, 0.0%ni, 43.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu15 : 11.9%us, 44.9%sy, 0.0%ni, 43.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu16 : 0.3%us, 3.7%sy, 0.3%ni, 95.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu17 : 1.0%us, 4.3%sy, 0.7%ni, 94.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 0.7%us, 2.3%sy, 0.3%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 0.7%us, 1.3%sy, 0.7%ni, 97.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 0.3%us, 1.4%sy, 0.7%ni, 97.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 0.3%us, 1.3%sy, 2.0%ni, 96.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 0.3%us, 0.7%sy, 0.3%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 0.3%us, 1.3%sy, 1.7%ni, 96.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu24 : 0.3%us, 1.3%sy, 0.3%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu25 : 0.3%us, 1.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu26 : 0.3%us, 1.0%sy, 0.3%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu27 : 0.0%us, 5.0%sy, 0.3%ni, 94.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu28 : 97.7%us, 2.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu29 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu30 : 0.3%us, 0.7%sy, 0.7%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu31 : 0.3%us, 2.6%sy, 0.3%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Yancey1989 commented 6 years ago

Hi @windy444 , 可以看下trainer_count=1trainer_count=2时,python进程的CPU利用率么?

windy444 commented 6 years ago

@Yancey1989 首先确认没跑paddle的时候,整体cpu基本是0 trainer_count=1 top - 17:00:50 up 430 days, 3:06, 9 users, load average: 4.25, 1.64, 0.78 Tasks: 607 total, 1 running, 605 sleeping, 1 stopped, 0 zombie Cpu0 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 2.3%us, 10.7%sy, 0.7%ni, 86.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 2.3%us, 10.4%sy, 0.7%ni, 86.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 3.0%us, 10.0%sy, 0.3%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 2.7%us, 10.3%sy, 0.3%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 2.3%us, 10.0%sy, 0.3%ni, 87.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 3.0%us, 10.0%sy, 0.0%ni, 87.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 3.0%us, 9.7%sy, 0.7%ni, 86.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 2.3%us, 10.0%sy, 0.0%ni, 87.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 3.0%us, 9.7%sy, 0.7%ni, 86.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 2.7%us, 10.0%sy, 0.0%ni, 87.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 2.3%us, 11.0%sy, 0.0%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu12 : 2.7%us, 10.1%sy, 0.0%ni, 87.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu13 : 3.0%us, 9.9%sy, 0.0%ni, 87.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu14 : 2.7%us, 9.7%sy, 0.3%ni, 87.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu15 : 1.7%us, 10.6%sy, 0.0%ni, 87.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu16 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu17 : 0.3%us, 2.3%sy, 1.0%ni, 96.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 0.7%us, 2.3%sy, 0.3%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 0.7%us, 1.7%sy, 0.3%ni, 97.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 0.3%us, 1.0%sy, 0.3%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 0.7%us, 0.7%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 1.0%us, 1.3%sy, 1.0%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu24 : 0.0%us, 0.7%sy, 0.3%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu25 : 0.7%us, 1.7%sy, 0.3%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu26 : 0.3%us, 1.7%sy, 0.7%ni, 97.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu27 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu28 : 0.3%us, 0.7%sy, 0.3%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu29 : 0.0%us, 0.7%sy, 0.3%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu30 : 0.0%us, 1.0%sy, 0.7%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu31 : 15.3%us, 54.5%sy, 0.0%ni, 30.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

trainer_count=2 top - 17:04:25 up 430 days, 3:10, 9 users, load average: 2.99, 2.10, 1.14 Tasks: 608 total, 1 running, 606 sleeping, 1 stopped, 0 zombie Cpu0 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 2.3%us, 10.7%sy, 0.3%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 2.3%us, 10.3%sy, 0.7%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 2.6%us, 10.3%sy, 0.3%ni, 86.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 2.3%us, 10.7%sy, 0.3%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 2.3%us, 10.3%sy, 0.0%ni, 87.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 2.3%us, 10.6%sy, 0.0%ni, 87.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 2.3%us, 10.3%sy, 0.3%ni, 87.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 2.7%us, 10.3%sy, 0.3%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 2.7%us, 10.3%sy, 0.0%ni, 87.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 3.0%us, 10.3%sy, 0.0%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 2.7%us, 10.7%sy, 0.0%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu12 : 2.3%us, 11.1%sy, 0.0%ni, 86.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu13 : 2.7%us, 10.4%sy, 0.3%ni, 86.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu14 : 2.7%us, 10.7%sy, 0.3%ni, 86.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu15 : 2.3%us, 11.0%sy, 0.0%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu16 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu17 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 0.0%us, 0.7%sy, 0.3%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 0.3%us, 0.0%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 0.3%us, 0.3%sy, 0.3%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu24 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu25 : 0.3%us, 0.3%sy, 0.3%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu26 : 0.3%us, 0.7%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu27 : 0.3%us, 0.7%sy, 0.3%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu28 : 0.7%us, 0.7%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu29 : 0.3%us, 1.7%sy, 0.7%ni, 97.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu30 : 0.0%us, 1.0%sy, 1.0%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu31 : 0.7%us, 1.0%sy, 0.0%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Yancey1989 commented 6 years ago

@windy444 类似这样的

trainer_count=1

%Cpu0  : 76.5 us,  1.3 sy,  0.0 ni, 20.2 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 18.9 us,  0.3 sy,  0.0 ni, 80.1 id,  0.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu12 :  4.3 us,  0.7 sy,  0.0 ni, 94.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu13 :  3.0 us,  0.7 sy,  0.0 ni, 96.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 :  0.7 us,  1.3 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 :  0.3 us,  0.7 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu16 :  0.7 us,  0.3 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu17 :  1.3 us,  0.7 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu18 : 13.0 us,  1.3 sy,  0.0 ni, 85.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu19 :  2.0 us,  2.3 sy,  0.0 ni, 95.3 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu20 : 13.6 us,  1.3 sy,  0.0 ni, 85.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu21 :  3.3 us,  1.7 sy,  0.0 ni, 95.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu22 :  2.6 us,  0.7 sy,  0.0 ni, 96.4 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu23 : 98.3 us,  0.0 sy,  0.0 ni,  1.3 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem : 26404276+total, 57028208 free, 25900816 used, 18111374+buff/cache
KiB Swap:   975868 total,        0 free,   975868 used. 23048297+avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
15099 root      20   0 2066412 651536  49528 R 100.0  0.2   0:07.46 python train.py -y 0 --model_arch 0 --class_num=2 --num_passes=100 --num_workers=1

trainer_count=2

%Cpu0  :  6.2 us,  6.2 sy,  0.0 ni, 87.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 41.2 us,  5.9 sy,  0.0 ni, 52.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 93.3 us,  0.0 sy,  0.0 ni,  6.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  : 50.0 us, 12.5 sy,  0.0 ni, 37.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu12 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu13 : 88.2 us,  5.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  5.9 si,  0.0 st
%Cpu14 : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 : 26.7 us,  6.7 sy,  0.0 ni, 66.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu16 : 93.3 us,  6.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu17 : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu18 : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu19 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu20 : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu21 : 88.2 us,  5.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  5.9 si,  0.0 st
%Cpu22 : 56.2 us,  6.2 sy,  0.0 ni, 37.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu23 : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 26404276+total, 57017188 free, 25997252 used, 18102832+buff/cache
KiB Swap:   975868 total,        0 free,   975868 used. 23038672+avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
15303 root      20   0 3542044 660096  49088 R  1894  0.2   0:20.57 python train.py -y 0 --model_arch 0 --class_num=2 --num_passes=100 --num_workers=2
windy444 commented 6 years ago

@Yancey1989 trainer_count=1 `top - 17:48:28 up 430 days, 3:54, 9 users, load average: 2.65, 2.37, 1.61 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie Cpu0 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 3.7%us, 13.0%sy, 0.3%ni, 83.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 3.7%us, 13.3%sy, 0.0%ni, 83.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 3.0%us, 13.7%sy, 0.0%ni, 83.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 3.7%us, 13.3%sy, 0.0%ni, 83.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 3.3%us, 13.4%sy, 0.0%ni, 83.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 2.7%us, 13.8%sy, 0.0%ni, 83.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 3.3%us, 13.4%sy, 0.0%ni, 83.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 4.0%us, 12.7%sy, 1.0%ni, 82.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 3.3%us, 13.4%sy, 0.0%ni, 83.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 3.7%us, 13.0%sy, 1.0%ni, 82.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 2.7%us, 14.0%sy, 0.0%ni, 83.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu12 : 3.3%us, 13.3%sy, 0.0%ni, 83.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu13 : 3.4%us, 13.6%sy, 0.7%ni, 82.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu14 : 3.7%us, 13.1%sy, 0.0%ni, 83.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu15 : 3.7%us, 13.0%sy, 0.0%ni, 83.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu16 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu17 : 0.3%us, 0.7%sy, 0.0%ni, 98.3%id, 0.7%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 0.3%us, 1.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 0.7%us, 1.0%sy, 0.3%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu24 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu25 : 0.3%us, 1.3%sy, 0.0%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu26 : 0.3%us, 1.0%sy, 1.0%ni, 97.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu27 : 0.3%us, 0.0%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu28 : 0.3%us, 0.7%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu29 : 0.3%us, 0.7%sy, 0.3%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu30 : 0.0%us, 0.3%sy, 0.7%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu31 : 0.7%us, 0.7%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 131836552k total, 125252140k used, 6584412k free, 412236k buffers Swap: 0k total, 0k used, 0k free, 63336072k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18522 work 20 0 3644m 1.3g 22m S 344.0 1.0 3:54.04 python `

trainer_count=2 `top - 17:43:27 up 430 days, 3:49, 9 users, load average: 4.87, 2.75, 1.40 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie Cpu0 : 99.0%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.7%si, 0.0%st Cpu1 : 2.7%us, 10.7%sy, 0.7%ni, 85.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 2.3%us, 11.4%sy, 0.3%ni, 85.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 2.0%us, 11.4%sy, 0.3%ni, 86.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 3.4%us, 10.4%sy, 0.7%ni, 85.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 2.3%us, 11.4%sy, 0.7%ni, 85.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 2.3%us, 11.4%sy, 0.0%ni, 86.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 2.7%us, 11.0%sy, 0.7%ni, 85.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 2.7%us, 11.0%sy, 0.0%ni, 86.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 2.3%us, 11.3%sy, 0.0%ni, 86.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 2.3%us, 11.4%sy, 0.0%ni, 86.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 2.0%us, 11.7%sy, 0.0%ni, 86.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu12 : 2.7%us, 11.0%sy, 0.0%ni, 86.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu13 : 2.3%us, 11.0%sy, 0.0%ni, 86.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu14 : 3.0%us, 10.9%sy, 0.0%ni, 86.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu15 : 2.0%us, 11.4%sy, 0.0%ni, 86.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu16 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu17 : 0.0%us, 0.0%sy, 0.3%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 0.3%us, 0.3%sy, 0.3%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 0.0%us, 0.3%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 1.0%us, 1.4%sy, 0.3%ni, 97.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu24 : 0.0%us, 0.3%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu25 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu26 : 0.0%us, 0.7%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu27 : 0.3%us, 1.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu28 : 0.3%us, 1.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu29 : 0.3%us, 0.7%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu30 : 0.0%us, 0.7%sy, 0.7%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu31 : 0.3%us, 1.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 131836552k total, 125839760k used, 5996792k free, 412228k buffers Swap: 0k total, 0k used, 0k free, 63329476k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6429 work 20 0 4463m 1.9g 22m S 297.7 1.5 11:03.32 python `

luotao1 commented 6 years ago

但在之前版本,每个核差不多能用到50%的。整体运行时间是之前的5倍左右

请问之前是什么版本?@windy444

Yancey1989 commented 6 years ago

根据gdb的调试情况来看,即使trainer_count=1的情况也会有很多iomp的线程:

(gdb) info thread
  Id   Target Id         Frame
* 1    Thread 0x7f915f7d8700 (LWP 51) "python" __memset_avx2 () at ../sysdeps/x86_64/multiarch/memset-avx2.S:161
  2    Thread 0x7f9147652700 (LWP 79) "python" runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:423
  3    Thread 0x7f9147e53700 (LWP 80) "python" runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:423
  4    Thread 0x7f9148654700 (LWP 81) "python" runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:423
  5    Thread 0x7f9148e55700 (LWP 82) "python" runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:423
  6    Thread 0x7f911b34b700 (LWP 83) "python" runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:423
  7    Thread 0x7f9111254780 (LWP 84) "python" 0x00007f911b3e4bd6 in _INTERNAL_25_______src_kmp_barrier_cpp_34128d84::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) () from /usr/local/lib/libiomp5.so
  8    Thread 0x7f9110e53800 (LWP 85) "python" 0x00007f911b3e4c61 in _INTERNAL_25_______src_kmp_barrier_cpp_34128d84::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) () from /usr/local/lib/libiomp5.so
  9    Thread 0x7f9110a52880 (LWP 86) "python" 0x00007f915f0c07f7 in sched_yield () at ../sysdeps/unix/syscall-template.S:84
  10   Thread 0x7f9110651900 (LWP 87) "python" 0x00007f911b3e4c61 in _INTERNAL_25_______src_kmp_barrier_cpp_34128d84::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) () from /usr/local/lib/libiomp5.so
  11   Thread 0x7f90e3ffc980 (LWP 88) "python" 0x00007f911b3e4c68 in _INTERNAL_25_______src_kmp_barrier_cpp_34128d84::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) () from /usr/local/lib/libiomp5.so
  12   Thread 0x7f90e3bfba00 (LWP 89) "python" 0x00007f911b3e4cc2 in _INTERNAL_25_______src_kmp_barrier_cpp_34128d84::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) () from /usr/local/lib/libiomp5.so
  13   Thread 0x7f90e37faa80 (LWP 90) "python" 0x00007f915f0c07f7 in sched_yield () at ../sysdeps/unix/syscall-template.S:84
  14   Thread 0x7f90e33f9b00 (LWP 91) "python" 0x00007f911b3e4c5c in _INTERNAL_25_______src_kmp_barrier_cpp_34128d84::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) () from /usr/local/lib/libiomp5.so
...

和 @luotao1 沟通后得知mkl会自动用满CPU来优化计算性能,所以

另外,发现我即使用一个线程跑,也是很多核被占用。用24个核的话,情况和这个差不多。

应该是mkl自动做的性能优化。

typhoonzero commented 6 years ago

如果默认iomp是开启的,是否可以设置iomp的线程数加速训练呢?或者如何关闭iomp,然后使用trainer_count加速?

luotao1 commented 6 years ago

用MKL加速的时候,需要设置一下环境变量,以达到最好的加速效果:

unset OMP_NUM_THREADS MKL_NUM_THREADS
export OMP_DYNAMIC="FALSE"
export KMP_AFFINITY="granularity=fine,compact,0,0" //如果本机没有开启超线程
(export KMP_AFFINITY="granularity=fine,compact,1,0" // 如果本机开启了超线程)
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
Yancey1989 commented 6 years ago

这些环境变量要和trainer_count设置成一样的吗?我发现设置成OMP_NUM_THREADS=1 MKL_NUM_THREADS=1, trainer_count=10其实也只能用到一个核。

tensor-tang commented 6 years ago

综合了下 @CAOYUHUI 和 @windy444 的问题,大致看起来就是没有绑core导致。

  1. 首先说trainer_count为1的情况,真正能使用到多核的时间占比是不一定的,它是跟实际所跑的任务相关,跟着网络怎么写的有关,并且由于cpu利用率一直在变化,所以看cpu利用率的时候大致用峰值看,有没有用满core的时候。 同时,现在的 trainer count为1时,Paddle也会自动调用多核,所以,理论上,只要你的硬件支持,此时应该也能看到满core跑的瞬间(就是我上面说的,不是所有时间都在满core跑)。如果实在没有这个瞬间,那么就可以怀疑是不是没有绑核。一般机器超线程是开的,所以用export KMP_AFFINITY="granularity=fine,compact,1,0"绑核之后再看。如果不知道机器是否开了超线程,用lscpu | grep "per core"看下,如果>1代表开着的。此时的OMP_NUM_THREADS MKL_NUM_THREADS应该是unset的才对,因为一般机器默认是没有这两个值的。

  2. 确定了trainer count之后,再来看>1的时候。此时就需要设置OMP_NUM_THREADS来达到最好性能(一般MKL_NUM_THREADS与前者设定一样即可)。他可以让Paddle一直都处在一个比较高利用率的时候。 OMP_NUM_THREADS这个值的设置应该搭配着机器实际有多少thread来决定。假设系统可用的CPU最大值为MAX_N,如果整个系统只有目前这一个任务,那么OMP_NUM_THREADS 需要等于int(MAX_N/trainer_count)即可达到最好性能。 至于,整个系统设置多少个trainer count才能达到性能最优。由于性能与core数不是成正比的,所以根据实际情况设置就好了,比如MAX_N=50, batchsizeize=64时。trainer_count设为8,OMP_NUM_THREADS设为6即可。

luotao1 commented 6 years ago

谢谢 @tensor-tang 的详细回答。还有几个疑问:

  1. 绑核这个操作,能不能封装进docker镜像?
  2. 假设系统可用的CPU最大值为MAX_N:这个是实际CPU的核数?用什么命令测呢?
tensor-tang commented 6 years ago
  1. 应该是可以的
  2. MAX_N具体指的就是前面大家列出的CPU占用率时的CPU最大个数。不一定是实际核数。用top或者lscpu都可以看。
luotao1 commented 6 years ago

@CAOYUHUI 和 @windy444 : 请先使用如下脚本来绑核和设置最优的MKL_NUM_THREADS和OMP_NUM_THREADS。linux本地环境或docker环境均可以:

#!/bin/bash 

logicalNumber=$(grep "processor" /proc/cpuinfo|sort -u|wc -l)
physicalNumber=$(grep "physical id" /proc/cpuinfo|sort -u|wc -l)
coreNumber=$(grep "cpu cores" /proc/cpuinfo|uniq|awk -F':' '{print $2}'|xargs)
HT=$((logicalNumber / (physicalNumber * coreNumber))) 

echo "****** CPU Information ******"
echo "Logical CPU Number  : ${logicalNumber}"
echo "Physical CPU Number : ${physicalNumber}"
echo "CPU Core Number     : ${coreNumber}"

if [ ${HT} -ne 1 ]; then
    echo "Hyper Threading(HT) : ON"
    export KMP_AFFINITY="granularity=fine,compact,1,0"
else
    echo "Hyper Threading(HT) : OFF"
    export KMP_AFFINITY="granularity=fine,compact,0,0"
fi

echo "********** Settings *********"
unset OMP_NUM_THREADS MKL_NUM_THREADS
trainerCount=$1
numThreads=$((logicalNumber / trainerCount))
export OMP_NUM_THREADS=${numThreads}
export MKL_NUM_THREADS=${numThreads}

echo "Trainer Count      : ${trainerCount}"
echo "OMP_NUM_THREADS    : ${OMP_NUM_THREADS}"
echo "MKL_NUM_THREADS    : ${MKL_NUM_THREADS}"

下载上述脚本存为: cpu_configure.sh,使用方法如下:

sh cpu_configure.sh TRAINER_COUNT

这是在我的服务器上运行的结果:

$ sh cpu_configure.sh 2
****** CPU Information ******
Logical CPU Number  : 12
Physical CPU Number : 2
CPU Core Number     : 6
Hyper Threading(HT) : OFF
********** Settings *********
Trainer Count      : 2
OMP_NUM_THREADS    : 6
MKL_NUM_THREADS    : 6

之后, @tensor-tang 会在源代码中加入上述功能。

luotao1 commented 6 years ago

@CAOYUHUI 和 @windy444 请更新代码,现在使用MKL的时候,已经绑核和设置最优的MKL_NUM_THREADS和OMP_NUM_THREADS了。

Bella-Zhao commented 6 years ago

@luotao1 我测试了下上述脚本,在我的脚本执行

sh cpu_configure.sh ${TRAINER_COUNT}
python train.py \
    --train_data_path /home/work/zhaoyijin/video-recsys-model/dssm/train_data_dir/train/train \
    --test_data_path /home/work/zhaoyijin/video-recsys-model/dssm/test_data_dir/test/test \
    --dic_path /home/work/zhaoyijin/video-recsys-model/dssm/dict_data_dir/feature_dict \
    --batch_size 1000 \
    --num_passes 17 \
    --model_type 0 \
    --share_network_between_source_target FALSE \
    --share_embed FALSE \
    --dnn_dims 512,216,216,216,128 \
    --num_workers ${TRAINER_COUNT} \
    --use_gpu FALSE \
    --class_num 2 \
    --model_output_prefix ./output_model/ \
    --num_batches_to_log 1

TRAINER_COUNT=24时:

****** CPU Information ******
Logical CPU Number  : 32
Physical CPU Number : 2
CPU Core Number     : 8
Hyper Threading(HT) : ON
********** Settings *********
Trainer Count      : 24
OMP_NUM_THREADS    : 1
MKL_NUM_THREADS    : 1

top - 15:24:26 up 437 days,  1:30, 10 users,  load average: 4.59, 3.52, 3.02
Tasks: 615 total,   3 running, 612 sleeping,   0 stopped,   0 zombie
Cpu0  : 96.4%us,  3.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 13.6%us, 50.7%sy,  0.0%ni, 35.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 14.3%us, 49.7%sy,  0.0%ni, 36.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 14.9%us, 49.3%sy,  0.0%ni, 35.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 14.3%us, 49.7%sy,  0.3%ni, 35.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 14.2%us, 49.7%sy,  0.3%ni, 35.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 14.6%us, 49.3%sy,  0.0%ni, 36.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 14.6%us, 49.3%sy,  0.0%ni, 36.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  : 13.3%us, 50.8%sy,  0.0%ni, 35.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  : 14.0%us, 50.2%sy,  0.3%ni, 35.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 : 14.6%us, 49.8%sy,  0.0%ni, 35.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 : 14.9%us, 49.3%sy,  0.0%ni, 35.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 : 14.5%us, 49.8%sy,  0.0%ni, 35.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 : 14.2%us, 49.7%sy,  0.0%ni, 36.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 : 14.6%us, 49.5%sy,  0.0%ni, 35.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 : 15.3%us, 49.2%sy,  0.0%ni, 35.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu16 :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu17 :  0.3%us,  0.7%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu18 :  0.3%us,  1.3%sy,  0.7%ni, 97.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu19 :  0.3%us,  0.0%sy,  0.3%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu20 :  0.3%us,  0.7%sy,  0.7%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu21 :  0.3%us,  0.3%sy,  0.3%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu22 :  0.0%us,  0.3%sy,  0.7%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu23 :  0.3%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu24 :  1.0%us,  1.4%sy,  0.7%ni, 97.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu25 :  1.0%us,  1.7%sy,  0.3%ni, 97.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu26 :  0.3%us,  1.0%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu27 :  0.7%us,  1.0%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu28 :  0.3%us,  2.0%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu29 :  0.3%us,  1.0%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu30 :  0.3%us,  1.4%sy,  1.0%ni, 97.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu31 :  0.3%us,  1.0%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

TRAINER_COUNT=1时:

****** CPU Information ******
Logical CPU Number  : 32
Physical CPU Number : 2
CPU Core Number     : 8
Hyper Threading(HT) : ON
********** Settings *********
Trainer Count      : 1
OMP_NUM_THREADS    : 32
MKL_NUM_THREADS    : 32

top - 15:27:10 up 437 days,  1:32, 10 users,  load average: 2.46, 3.05, 2.93
Tasks: 613 total,   2 running, 611 sleeping,   0 stopped,   0 zombie
Cpu0  : 91.7%us,  7.9%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu1  : 15.9%us, 40.7%sy,  0.3%ni, 43.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 16.2%us, 40.9%sy,  0.0%ni, 42.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 15.7%us, 41.0%sy,  0.3%ni, 43.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 15.8%us, 40.9%sy,  0.0%ni, 43.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 15.9%us, 41.1%sy,  0.0%ni, 43.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 15.9%us, 41.4%sy,  0.3%ni, 42.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 15.4%us, 41.3%sy,  0.3%ni, 43.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  : 15.1%us, 41.8%sy,  0.3%ni, 42.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  : 14.8%us, 42.0%sy,  0.3%ni, 43.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 : 15.1%us, 41.8%sy,  0.3%ni, 42.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 : 14.8%us, 42.4%sy,  0.3%ni, 42.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 : 15.6%us, 41.7%sy,  0.0%ni, 42.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 : 15.2%us, 41.7%sy,  0.0%ni, 43.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 : 15.2%us, 42.1%sy,  0.0%ni, 42.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 : 15.9%us, 41.1%sy,  0.0%ni, 43.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu16 :  0.0%us,  0.7%sy,  1.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu17 :  0.7%us,  1.0%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu18 :  0.3%us,  3.0%sy,  1.3%ni, 95.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu19 :  0.3%us,  1.3%sy,  0.7%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu20 :  0.0%us,  0.3%sy,  0.3%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu21 :  0.0%us,  0.7%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu22 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu23 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu24 :  0.3%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu25 :  0.7%us,  2.3%sy,  0.7%ni, 96.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu26 :  0.3%us,  1.7%sy,  0.3%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu27 :  0.3%us,  1.7%sy,  0.7%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu28 :  0.7%us,  1.0%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu29 :  0.3%us,  1.0%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu30 :  0.0%us,  0.7%sy,  0.3%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu31 :  0.7%us,  1.0%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

两者cpu占用率没有什么区别,请帮忙看看,我使用的是否正确,谢谢。

tensor-tang commented 6 years ago

应该是你设置的变量没有生效,

你可以在加 paddle train前面加echo $OMP_NUM_THREADS 确认下是否生效。

CAOYUHUI commented 6 years ago

@luotao1 你好,我尝试了cpu_configure.sh那个脚本。运行脚本后训练,第一次尝试trainer_count=4,用htop查看cpu情况,发现确实运行了多个cpu,感觉多线程是生效了。但是后续再尝试训练,htop发现依旧占用着同一个cpu。 cpu_configure.sh脚本运行结果:

2017-11-16 5 24 48

htop情况:

2017-11-16 5 24 04

麻烦帮忙看看是什么情况,十分感谢~

tensor-tang commented 6 years ago

在脚本里面的echo输出是正常的,但是我说的是在你的paddle train前面加echo。 因为你截图的那个环境变量并不一定在你当前的脚本环境下生效,top是都会有的,我估计在按照如下的方法加入echo是没有输出的,

sh cpu_configure.sh ${TRAINER_COUNT} echo $OMP_NUM_THREADS python train.py \

请先确定这个是否正确,谢谢

CAOYUHUI commented 6 years ago

@tensor-tang 你好,运行echo $OMP_NUM_THREADS输出为空。

tensor-tang commented 6 years ago

谢谢,那就证明脚本里面的环境变量确实没有生效。

需要使用source cpu_configure.sh ${TRAINER_COUNT}。再看应该就有值了。

或者使用最新的paddle编译安装也可以,脚本的功能已经集成了,不需要自己配置也可以了。

luotao1 commented 6 years ago

但是后续再尝试训练,htop发现依旧占用着同一个cpu。

请问后续的训练,trainer_count依然是4?@CAOYUHUI

CAOYUHUI commented 6 years ago

@luotao1 后来设置的trainer_count=8。 @tensor-tang 用source,在echo就是有值的了。但是训练依旧只用一个核,而且训练一个batch使用的时间和之前一样。

luotao1 commented 6 years ago

请问用的paddle是什么版本的?

CAOYUHUI commented 6 years ago

@luotao1 用的是v2,pip安装的。

luotao1 commented 6 years ago

pip安装的是最新版本的paddle,还是0.10.0版本的呢?

CAOYUHUI commented 6 years ago

@luotao1 是0.10.0的

luotao1 commented 6 years ago

但是后续再尝试训练,htop发现依旧占用着同一个cpu。

如果更换了trainer_count的数量,得重新运行下脚本。重新运行脚本后,htop仍然是一个cpu么?

peterzhang2029 commented 6 years ago

Closing due to low activity. Feel free to reopen it.