Open rkotimi opened 2 years ago
另外,我想顺带问一个问题:PaddlePaddle在DCU上是否支持混合精度训练呢?
@rkotimi 从报错信息来看,里面提示的是共享内存不足,想问下:
另外,DCU上的混合精度训练是支持的
@rkotimi 从报错信息来看,里面提示的是共享内存不足,想问下:
- 是否使用了GPU训练?从日志来看,没有打印出GPU设备,看起来是在CPU训练的
- 你本机的共享显存大小是多少呢?
感谢你的回复。
free -m
命令,输出结果为:
total used free shared buff/cache available
Mem: 257581 39613 31850 28317 186117 187894
Swap: 7811 4251 3560
用ipcs -m
命令,输出结果为:
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x00000000 65536 root 600 524288 2 dest
0x00000000 98305 root 600 4194304 2 dest
0x00000000 393218 pioneer1_5 600 393216 2 dest
0x00000000 491523 pioneer1_5 600 524288 2 dest
0x00000000 589828 pioneer1_5 600 524288 2 dest
0x00000000 688133 pioneer1_5 600 524288 2 dest
0x00000000 786438 pioneer1_5 600 393216 2 dest
0x00000000 884743 pioneer1_5 600 393216 2 dest
0x00000000 1114120 pioneer1_5 600 393216 2 dest
0x00000000 1146889 pioneer1_5 600 393216 2 dest
0x00000000 1245194 pioneer1_5 600 393216 2 dest
0x00000000 1540107 pioneer1_5 600 2097152 2 dest
0x00000000 1572876 pioneer1_5 600 524288 2 dest
0x00000000 1605645 pioneer1_5 600 393216 2 dest
0x00000000 1180827662 pioneer1_5 600 524288 2 dest
0x00000000 1736719 pioneer1_5 600 524288 2 dest
0x00000000 1835024 pioneer1_5 600 524288 2 dest
0x00000000 1959362577 pioneer1_2 600 393216 2 dest
0x00000000 1959460882 pioneer1_2 600 524288 2 dest
0x00000000 1959559187 pioneer1_2 600 393216 2 dest
0x00000000 1959657492 pioneer1_2 600 524288 2 dest
0x00000000 1959755797 pioneer1_2 600 524288 2 dest
0x00000000 1959788566 pioneer1_2 600 393216 2 dest
0x00000000 1960017943 pioneer1_2 600 393216 2 dest
0x00000000 1960050712 pioneer1_2 600 393216 2 dest
0x00000000 1960083481 pioneer1_2 600 393216 2 dest
0x00000000 169345050 pioneer1_8 600 13967360 2 dest
0x00000000 770277403 pioneer1_4 606 11301120 2 dest
0x00000000 980484124 pioneer1_8 600 524288 2 dest
0x00000000 1960902685 pioneer1_2 600 393216 2 dest
0x00000000 73236510 pioneer1_2 600 20480 2 dest
0x00000000 54984735 pioneer1_4 600 393216 2 dest
0x00000000 6520864 pioneer1_8 600 393216 2 dest
0x00000000 6619169 pioneer1_8 600 524288 2 dest
0x00000000 6717474 pioneer1_8 600 393216 2 dest
0x00000000 6815779 pioneer1_8 600 393216 2 dest
0x00000000 6979620 pioneer1_8 600 524288 2 dest
0x00000000 7012389 pioneer1_8 600 524288 2 dest
0x00000000 7176230 pioneer1_8 600 393216 2 dest
0x00000000 247988263 pioneer1_8 600 393216 2 dest
0x00000000 7307304 pioneer1_8 600 393216 2 dest
0x00000000 104300585 pioneer1_7 600 393216 2 dest
0x00000000 7569450 pioneer1_8 600 524288 2 dest
0x00000000 666173483 pioneer1_5 600 524288 2 dest
0x00000000 8159276 pioneer1_8 600 524288 2 dest
0x00000000 237830189 pioneer1_8 600 294912 2 dest
0x00000000 7897134 pioneer1_8 600 524288 2 dest
0x00000000 8323119 pioneer1_8 600 7168000 2 dest
0x00000000 169312304 pioneer1_8 600 13967360 2 dest
0x00000000 237862961 pioneer1_8 600 294912 2 dest
0x00000000 8290354 pioneer1_8 600 7168000 2 dest
0x00000000 8749107 pioneer1_8 600 524288 2 dest
0x00000000 93356084 pioneer1_8 600 8089600 2 dest
0x00000000 151814197 pioneer1_8 600 237568 2 dest
0x00000000 287342646 pioneer1_8 600 20480 2 dest
0x00000000 9470007 pioneer1_8 600 393216 2 dest
0x00000000 9338936 pioneer1_8 600 1118208 2 dest
0x00000000 980680761 pioneer1_4 600 524288 2 dest
0x00000000 16973882 pioneer1_8 600 393216 2 dest
0x00000000 817135675 pioneer1_2 600 13967360 2 dest
0x00000000 42139708 pioneer1_4 600 393216 2 dest
0x00000000 94634045 pioneer1_8 600 139264 2 dest
0x00000000 19923006 pioneer1_8 600 1118208 2 dest
0x00000000 42238015 pioneer1_4 600 524288 2 dest
0x00000000 42336320 pioneer1_4 600 524288 2 dest
0x00000000 42434625 pioneer1_4 600 524288 2 dest
0x00000000 42598466 pioneer1_4 600 393216 2 dest
0x00000000 42631235 pioneer1_4 600 393216 2 dest
0x00000000 42795076 pioneer1_4 600 393216 2 dest
0x00000000 667680837 pioneer1_4 600 393216 2 dest
0x00000000 42860614 pioneer1_4 600 393216 2 dest
0x00000000 770244679 pioneer1_4 606 11301120 2 dest
0x00000000 813465672 pioneer1_4 600 393216 2 dest
0x00000000 43352137 pioneer1_4 600 393216 2 dest
0x00000000 43974730 pioneer1_4 600 524288 2 dest
0x00000000 43548747 pioneer1_4 600 524288 2 dest
0x00000000 43647052 pioneer1_4 600 524288 2 dest
0x00000000 1961132109 pioneer1_2 600 524288 2 dest
0x00000000 899022926 pioneer1_4 600 393216 2 dest
0x00000000 169279567 pioneer1_8 600 4194304 2 dest
0x00000000 44073040 pioneer1_4 606 2880000 2 dest
0x00000000 44105809 pioneer1_4 606 2880000 2 dest
0x00000000 55017554 pioneer1_4 600 524288 2 dest
0x00000000 44597331 pioneer1_4 600 524288 2 dest
0x00000000 238518356 pioneer1_8 600 524288 2 dest
0x00000000 88735829 pioneer1_7 600 524288 2 dest
0x00000000 316145750 amax 600 524288 2 dest
0x00000000 241926231 pioneer1_8 600 524288 2 dest
0x00000000 102400088 pioneer1_7 600 393216 2 dest
0x00000000 75333721 amax 600 393216 2 dest
0x00000000 75432026 amax 600 524288 2 dest
0x00000000 75530331 amax 600 524288 2 dest
0x00000000 75628636 amax 600 524288 2 dest
0x00000000 75726941 amax 600 393216 2 dest
0x00000000 75890782 amax 600 393216 2 dest
0x00000000 75989087 amax 600 393216 2 dest
0x00000000 76021856 amax 600 393216 2 dest
0x00000000 76054625 amax 600 393216 2 dest
0x00000000 77693026 amax 600 524288 2 dest
0x00000000 76382307 amax 600 393216 2 dest
0x00000000 76415076 amax 600 524288 2 dest
0x00000000 76873829 amax 600 524288 2 dest
0x00000000 76972134 amax 600 393216 2 dest
0x00000000 77922407 amax 606 10840320 2 dest
0x00000000 76841064 amax 600 524288 2 dest
0x00000000 77070441 amax 600 524288 2 dest
0x00000000 77201514 amax 600 524288 2 dest
0x00000000 77791339 amax 600 524288 2 dest
0x00000000 77955180 amax 606 10840320 2 dest
0x00000000 77987949 amax 606 2880000 2 dest
0x00000000 77889646 amax 600 2097152 2 dest
0x00000000 78020719 amax 606 2880000 2 dest
0x00000000 1961164912 pioneer1_2 600 393216 2 dest
0x00000000 287375473 pioneer1_8 600 20480 2 dest
0x00000000 1960706162 pioneer1_2 600 524288 2 dest
0x00000000 78446707 amax 600 393216 2 dest
0x00000000 93323380 pioneer1_8 600 8089600 2 dest
0x00000000 1960738933 pioneer1_2 600 393216 2 dest
0x00000000 1961197686 pioneer1_2 600 524288 2 dest
0x00000000 238551159 pioneer1_8 606 4718592 2 dest
0x00000000 93454456 pioneer1_8 600 90112 2 dest
0x00000000 93487225 pioneer1_8 600 90112 2 dest
0x00000000 1554153594 pioneer1_8 600 61440 2 dest
0x00000000 93847675 pioneer1_8 600 53248 2 dest
0x00000000 238583932 pioneer1_8 606 4718592 2 dest
0x00000000 93880445 pioneer1_8 600 53248 2 dest
0x00000000 102498430 pioneer1_7 600 524288 2 dest
0x00000000 102596735 pioneer1_7 600 393216 2 dest
0x00000000 102695040 pioneer1_7 600 524288 2 dest
0x00000000 94601345 pioneer1_8 600 139264 2 dest
0x00000000 102793346 pioneer1_7 600 524288 2 dest
0x00000000 151781507 pioneer1_8 600 237568 2 dest
0x00000000 102957188 pioneer1_7 600 393216 2 dest
0x00000000 102989957 pioneer1_7 600 393216 2 dest
0x00000000 103088262 pioneer1_7 600 393216 2 dest
0x00000000 238616711 pioneer1_8 606 2880000 2 dest
0x00000000 103219336 pioneer1_7 600 393216 2 dest
0x00000000 238649481 pioneer1_8 606 2880000 2 dest
0x00000000 2039611530 pioneer1_5 600 524288 2 dest
0x00000000 820412555 pioneer1_2 600 151552 2 dest
0x00000000 103678092 pioneer1_7 600 524288 2 dest
0x00000000 103841933 pioneer1_7 600 524288 2 dest
0x00000000 103874702 pioneer1_7 600 524288 2 dest
0x00000000 1180565647 pioneer1_5 600 393216 2 dest
0x00000000 292814992 pioneer1_8 600 229376 2 dest
0x00000000 288194705 pioneer1_8 600 425984 2 dest
0x00000000 288227474 pioneer1_8 600 425984 2 dest
0x00000000 248250515 pioneer1_8 600 393216 2 dest
0x00000000 74907796 pioneer1_2 600 139264 2 dest
0x00000000 2054324373 pioneer1_4 600 1048576 2 dest
0x00000000 1961296022 pioneer1_2 600 4194304 2 dest
0x00000000 1915093143 pioneer1_4 600 524288 2 dest
0x00000000 819363992 pioneer1_7 600 151552 2 dest
0x00000000 288489625 pioneer1_8 600 45056 2 dest
0x00000000 930185370 pioneer1_7 600 524288 2 dest
0x00000000 288456859 pioneer1_8 600 45056 2 dest
0x00000000 292782236 pioneer1_8 600 229376 2 dest
0x00000000 103088285 pioneer1_2 600 524288 2 dest
0x00000000 172163230 pioneer1_8 600 524288 2 dest
0x00000000 899154079 pioneer1_4 600 393216 2 dest
0x00000000 173441184 pioneer1_7 600 24576 2 dest
0x00000000 173768865 pioneer1_8 600 524288 2 dest
0x00000000 164495522 pioneer1_8 600 393216 2 dest
0x00000000 629833891 pioneer1_4 600 393216 2 dest
0x00000000 667943076 pioneer1_4 600 393216 2 dest
0x00000000 1198882981 pioneer1_5 600 1048576 2 dest
0x00000000 511049894 pioneer1_7 600 16384 2 dest
0x00000000 819396775 pioneer1_7 600 151552 2 dest
0x00000000 980582568 pioneer1_7 600 524288 2 dest
0x00000000 548372649 pioneer1_4 600 393216 2 dest
0x00000000 225083562 pioneer1_4 600 16777216 2 dest
0x00000000 88834219 pioneer1_7 600 4194304 2 dest
0x00000000 173473964 pioneer1_7 600 24576 2 dest
0x00000000 174129325 pioneer1_7 600 139264 2 dest
0x00000000 316178606 amax 600 4194304 2 dest
0x00000000 199491759 pioneer1_4 600 4194304 2 dest
0x00000000 820445360 pioneer1_2 600 151552 2 dest
0x00000000 373162161 pioneer1_7 600 13967360 2 dest
0x00000000 1963819186 pioneer1_2 600 524288 2 dest
0x00000000 1581711539 pioneer1_8 600 225280 2 dest
0x00000000 1538621620 pioneer1_8 600 24576 2 dest
0x00000000 2114584757 pioneer1_5 600 46312 2 dest
0x00000000 1358725302 pioneer1_8 600 393216 2 dest
0x00000000 1581678775 pioneer1_8 600 225280 2 dest
0x00000000 1558151352 pioneer1_8 600 151552 2 dest
0x00000000 233963705 pioneer1_8 600 18128 2 dest
0x00000000 1538654394 pioneer1_8 600 24576 2 dest
0x00000000 73269435 pioneer1_2 600 20480 2 dest
0x00000000 1557266620 pioneer1_8 600 28672 2 dest
0x00000000 1554186429 pioneer1_8 600 61440 2 dest
0x00000000 169705662 pioneer1_8 600 16384 2 dest
0x00000000 1555628223 pioneer1_8 600 1118208 2 dest
0x00000000 1555660992 pioneer1_8 600 1118208 2 dest
0x00000000 1557299393 pioneer1_8 600 28672 2 dest
0x00000000 1558118594 pioneer1_8 600 151552 2 dest
0x00000000 174162115 pioneer1_7 600 139264 2 dest
0x00000000 73400516 pioneer1_2 600 32768 2 dest
0x00000000 73433285 pioneer1_2 600 32768 2 dest
0x00000000 73564358 pioneer1_2 600 417792 2 dest
0x00000000 73597127 pioneer1_2 600 425984 2 dest
0x00000000 548405448 pioneer1_4 600 12288 2 dest
0x00000000 548438217 pioneer1_4 600 393216 2 dest
0x00000000 174817482 pioneer1_7 600 139264 2 dest
0x00000000 74875083 pioneer1_2 600 139264 2 dest
0x00000000 95158476 pioneer1_2 600 53248 2 dest
0x00000000 91324621 pioneer1_2 600 110592 2 dest
0x00000000 169083086 pioneer1_8 600 180224 2 dest
0x00000000 168886479 pioneer1_8 600 425984 2 dest
0x00000000 511082704 pioneer1_7 600 16384 2 dest
0x00000000 548536529 pioneer1_4 600 524288 2 dest
0x00000000 813433042 pioneer1_4 600 524288 2 dest
0x00000000 169672915 pioneer1_8 600 16384 2 dest
0x00000000 95191252 pioneer1_2 600 53248 2 dest
0x00000000 168919253 pioneer1_8 600 425984 2 dest
0x00000000 89325782 pioneer1_2 600 106496 2 dest
0x00000000 89358551 pioneer1_2 600 106496 2 dest
0x00000000 169050328 pioneer1_8 600 180224 2 dest
0x00000000 91291865 pioneer1_2 600 110592 2 dest
0x00000000 979828954 pioneer1_2 600 62464 2 dest
0x00000000 174784731 pioneer1_7 600 139264 2 dest
0x00000000 980779228 pioneer1_5 600 524288 2 dest
0x00000000 813564125 pioneer1_4 600 393216 2 dest
0x00000000 821526750 pioneer1_2 600 524288 2 dest
0x00000000 980386015 amax 600 524288 2 dest
0x00000000 509608160 pioneer1_2 600 24576 2 dest
0x00000000 509640929 pioneer1_2 600 24576 2 dest
0x00000000 817201378 pioneer1_2 600 13967360 2 dest
0x00000000 819069155 pioneer1_7 600 13967360 2 dest
0x00000000 965050597 pioneer1_2 600 524288 2 dest
@rkotimi 试试打印下 df -h
看看
num_workers >0会遇到同样的问题,只有num_workers=0时才可以正常训练。
@Luna2199 你的运行命令和报错日志麻烦发下
我用Paddle3D的smoke模型跑其他数据集,并仿照KittiDetDataset写了一个dataset类,跑到一半的时候报了很奇怪的错误。如果num_workers为0的话,就不会有这个错。
我花了很久的时间定位到了具体的问题:当load_annotation的返回为np.array([])时,就会报这个错。经过验证,我发现只要将KittiDetDataset中的172行至176行注释掉,也会出现类似的错误。
我认为这是Paddle3D或者PaddlePaddle的bug,希望能给出解决方案。
命令:
报错信息: