PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.27k stars 5.6k forks source link

eager_generator': corrupted double-linked list: 0x0000000006ee2200 *** #61834

Open WanwanLinLin opened 9 months ago

WanwanLinLin commented 9 months ago

问题描述 Issue Description

我尝试在CentOS7.9本地编译,我的ldd --version是2.17,是系统自带的,这是我的构建编译命令: cmake .. -DPY_VERSION=3.9 -DWITH_GPU=OFF -DWITH_NCCL=OFF -DWITH_MKLDNN=OFF \ -DWITH_RCCL=OFF -DCMAKE_INSTALL_PREFIX=/home/cproject/Paddle/install

make -j1

但每次都是报这个错误: Error in `/home/cproject/Paddle/build/paddle/fluid/eager/auto_code_generator/eager_generator': corrupted double-linked list: 0x0000000006ee2200 ======= Backtrace: ========= /lib64/libc.so.6(+0x8097f)[0x7f922a79f97f] /lib64/libc.so.6(+0x8120e)[0x7f922a7a020e] /home/cproject/Paddle/build/paddle/phi/libphi.so(_ZN3phi13KernelFactoryD1Ev+0x18a)[0x7f922cafcaca] /lib64/libc.so.6(__cxa_finalize+0x9a)[0x7f922a75905a] /home/cproject/Paddle/build/paddle/phi/libphi.so(+0xce3707)[0x7f922c5e9707] ======= Memory map: ======== 00400000-004c1000 r--p 00000000 fd:02 79165216 /home/cproject/Paddle/build/paddle/fluid/eager/auto_code_generator/eager_generator 004c1000-048e2000 r-xp 000c1000 fd:02 79165216 /home/cproject/Paddle/build/paddle/fluid/eager/auto_code_generator/eager_generator 048e2000-05433000 r--p 044e2000 fd:02 79165216 /home/cproject/Paddle/build/paddle/fluid/eager/auto_code_generator/eager_generator 05434000-054e6000 r--p 05033000 fd:02 79165216 /home/cproject/Paddle/build/paddle/fluid/eager/auto_code_generator/eager_generator 054e6000-05520000 rw-p 050e5000 fd:02 79165216 /home/cproject/Paddle/build/paddle/fluid/eager/auto_code_generator/eager_generator 05520000-05557000 rw-p 00000000 00:00 0 06b55000-087c4000 rw-p 00000000 00:00 0 [heap] 7f9224000000-7f9224021000 rw-p 00000000 00:00 0 7f9224021000-7f9228000000 ---p 00000000 00:00 0 7f922a4f9000-7f922a51e000 r-xp 00000000 fd:00 33555743 /usr/lib64/libgomp.so.1.0.0 7f922a51e000-7f922a71d000 ---p 00025000 fd:00 33555743 /usr/lib64/libgomp.so.1.0.0 7f922a71d000-7f922a71e000 r--p 00024000 fd:00 33555743 /usr/lib64/libgomp.so.1.0.0 7f922a71e000-7f922a71f000 rw-p 00025000 fd:00 33555743 /usr/lib64/libgomp.so.1.0.0 7f922a71f000-7f922a8e3000 r-xp 00000000 fd:00 33555275 /usr/lib64/libc-2.17.so 7f922a8e3000-7f922aae2000 ---p 001c4000 fd:00 33555275 /usr/lib64/libc-2.17.so 7f922aae2000-7f922aae6000 r--p 001c3000 fd:00 33555275 /usr/lib64/libc-2.17.so 7f922aae6000-7f922aae8000 rw-p 001c7000 fd:00 33555275 /usr/lib64/libc-2.17.so 7f922aae8000-7f922aaed000 rw-p 00000000 00:00 0 7f922aaed000-7f922ab02000 r-xp 00000000 fd:00 33554508 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 7f922ab02000-7f922ad01000 ---p 00015000 fd:00 33554508 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 7f922ad01000-7f922ad02000 r--p 00014000 fd:00 33554508 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 7f922ad02000-7f922ad03000 rw-p 00015000 fd:00 33554508 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 7f922ad03000-7f922ae04000 r-xp 00000000 fd:00 33555286 /usr/lib64/libm-2.17.so 7f922ae04000-7f922b003000 ---p 00101000 fd:00 33555286 /usr/lib64/libm-2.17.so 7f922b003000-7f922b004000 r--p 00100000 fd:00 33555286 /usr/lib64/libm-2.17.so 7f922b004000-7f922b005000 rw-p 00101000 fd:00 33555286 /usr/lib64/libm-2.17.so 7f922b005000-7f922b0ee000 r-xp 00000000 fd:00 33555391 /usr/lib64/libstdc++.so.6.0.19 7f922b0ee000-7f922b2ee000 ---p 000e9000 fd:00 33555391 /usr/lib64/libstdc++.so.6.0.19 7f922b2ee000-7f922b2f6000 r--p 000e9000 fd:00 33555391 /usr/lib64/libstdc++.so.6.0.19 7f922b2f6000-7f922b2f8000 rw-p 000f1000 fd:00 33555391 /usr/lib64/libstdc++.so.6.0.19 7f922b2f8000-7f922b30d000 rw-p 00000000 00:00 0 7f922b30d000-7f922b30f000 r-xp 00000000 fd:00 33555283 /usr/lib64/libdl-2.17.so 7f922b30f000-7f922b50f000 ---p 00002000 fd:00 33555283 /usr/lib64/libdl-2.17.so 7f922b50f000-7f922b510000 r--p 00002000 fd:00 33555283 /usr/lib64/libdl-2.17.so 7f922b510000-7f922b511000 rw-p 00003000 fd:00 33555283 /usr/lib64/libdl-2.17.so 7f922b511000-7f922b6cb000 r-xp 00000000 fd:02 211392241 /home/cproject/Paddle/build/third_party/install/mklml/lib/libiomp5.so 7f922b6cb000-7f922b8ca000 ---p 001ba000 fd:02 211392241 /home/cproject/Paddle/build/third_party/install/mklml/lib/libiomp5.so 7f922b8ca000-7f922b8cd000 r--p 001b9000 fd:02 211392241 /home/cproject/Paddle/build/third_party/install/mklml/lib/libiomp5.so 7f922b8cd000-7f922b8d7000 rw-p 001bc000 fd:02 211392241 /home/cproject/Paddle/build/third_party/install/mklml/lib/libiomp5.so 7f922b8d7000-7f922b906000 rw-p 00000000 00:00 0 7f922b906000-7f922c17e000 r--p 00000000 fd:02 84662122 /home/cproject/Paddle/build/paddle/phi/libphi.so 7f922c17e000-7f922f9ed000 r-xp 00878000 fd:02 84662122 /home/cproject/Paddle/build/paddle/phi/libphi.so 7f922f9ed000-7f922ff02000 r--p 040e7000 fd:02 84662122 /home/cproject/Paddle/build/paddle/phi/libphi.so 7f922ff02000-7f922ff34000 r--p 045fb000 fd:02 84662122 /home/cproject/Paddle/build/paddle/phi/libphi.so 7f922ff34000-7f922ff6d000 rw-p 0462d000 fd:02 84662122 /home/cproject/Paddle/build/paddle/phi/libphi.so 7f922ff6d000-7f922ffd9000 rw-p 00000000 00:00 0 7f922ffd9000-7f922ffe0000 r-xp 00000000 fd:00 33555312 /usr/lib64/librt-2.17.so 7f922ffe0000-7f92301df000 ---p 00007000 fd:00 33555312 /usr/lib64/librt-2.17.so 7f92301df000-7f92301e0000 r--p 00006000 fd:00 33555312 /usr/lib64/librt-2.17.so 7f92301e0000-7f92301e1000 rw-p 00007000 fd:00 33555312 /usr/lib64/librt-2.17.so 7f92301e1000-7f92301f8000 r-xp 00000000 fd:00 33555307 /usr/lib64/libpthread-2.17.so 7f92301f8000-7f92303f7000 ---p 00017000 fd:00 33555307 /usr/lib64/libpthread-2.17.so 7f92303f7000-7f92303f8000 r--p 00016000 fd:00 33555307 /usr/lib64/libpthread-2.17.soSubprocess aborted make[2]: [paddle/fluid/eager/auto_code_generator/CMakeFiles/legacy_eager_codegen] 错误 1 make[1]: [paddle/fluid/eager/auto_code_generator/CMakeFiles/legacy_eager_codegen.dir/all] 错误 2 make: *** [all] 错误 2

版本&环境信息 Version & Environment Information

CPU版本

WanwanLinLin commented 9 months ago

我编译的是develop版本的paddle,是按照官方教程来的

risemeup1 commented 9 months ago

make -j10呢?

WanwanLinLin commented 9 months ago

make -j10呢?

一样

WanwanLinLin commented 9 months ago

如果设置-DWITH_PYTHON=OFF就能够编译成功

risemeup1 commented 8 months ago

我在我们这边的Centos机器上编译是可以成功的,你用的是docker编译吗?

TimeYWL commented 8 months ago

我这边在centos7.6 + rocm 环境下编译也遇到了这个问题。 环境:python 3.8, gcc 11.2/gcc 7.3, dtk 23.10。 使用 valgrind 分析eager_generator有以下结果: ==9562== Invalid read of size 4 ==9562== at 0x33AFE56D: ??? (in /usr/lib64/libstdc++.so.6.0.19) ==9562== by 0x33B60E22: std::basic_string<char, std::char_traits, std::allocator >::~basic_string() (in /usr/lib64/libstdc++.so.6.0.19) ==9562== by 0x34523059: cxa_finalize (in /usr/lib64/libc-2.17.so) ==9562== by 0x16B2BA26: ??? (in /workspace/Paddle-2.6.0/build/paddle/phi/libphi.so) ==9562== by 0xAAB1089: _dl_fini (in /usr/lib64/ld-2.17.so) ==9562== by 0x34522CE8: __run_exit_handlers (in /usr/lib64/libc-2.17.so) ==9562== by 0x34522D36: exit (in /usr/lib64/libc-2.17.so) ==9562== by 0x3450B55B: (below main) (in /usr/lib64/libc-2.17.so) ==9562== Address 0x35da3030 is 16 bytes inside a block of size 34 free'd ==9562== at 0xB6CC51D: operator delete(void*) (vg_replace_malloc.c:586) ==9562== by 0x33B60E22: std::basic_string<char, std::char_traits, std::allocator >::~basic_string() (in /usr/lib64/libstdc++.so.6.0.19) ==9562== by 0x34522CE8: run_exit_handlers (in /usr/lib64/libc-2.17.so) ==9562== by 0x34522D36: exit (in /usr/lib64/libc-2.17.so) ==9562== by 0x3450B55B: (below main) (in /usr/lib64/libc-2.17.so) ==9562== Block was alloc'd at ==9562== at 0xB6CB593: operator new(unsigned long) (vg_replace_malloc.c:344) ==9562== by 0x33B60CD8: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator const&) (in /usr/lib64/libstdc++.so.6.0.19) ==9562== by 0x114652B: char std::string::_S_construct<char const>(char const, char const, std::allocator const&, std::forward_iterator_tag) (basic_string.tcc:610) ==9562== by 0x113B053: char std::string::_S_construct_aux<char const>(char const, char const, std::allocator const&, std::false_type) (basic_string.h:5180) ==9562== by 0x112CD50: char std::string::_S_construct<char const>(char const, char const, std::allocator const&) (basic_string.h:5201) ==9562== by 0x11213FB: std::basic_string<char, std::char_traits, std::allocator >::basic_string<std::allocator >(char const*, std::allocator const&) (basic_string.h:3663) ==9562== by 0xE5179F: _GLOBALsub_I_logging.cc (in /workspace/Paddle-2.6.0/build/paddle/fluid/eager/auto_code_generator/eager_generator) ==9562== by 0x5C1677C: libc_csu_init (in /workspace/Paddle-2.6.0/build/paddle/fluid/eager/auto_code_generator/eager_generator) ==9562== by 0x3450B4E4: (below main) (in /usr/lib64/libc-2.17.so) ==9562== ==9562== Invalid free() / delete / delete[] / realloc() ==9562== at 0xB6CC51D: operator delete(void*) (vg_replace_malloc.c:586) ==9562== by 0x33B60E22: std::basic_string<char, std::char_traits, std::allocator >::~basic_string() (in /usr/lib64/libstdc++.so.6.0.19) ==9562== by 0x34523059: cxa_finalize (in /usr/lib64/libc-2.17.so) ==9562== by 0x16B2BA26: ??? (in /workspace/Paddle-2.6.0/build/paddle/phi/libphi.so) ==9562== by 0xAAB1089: _dl_fini (in /usr/lib64/ld-2.17.so) ==9562== by 0x34522CE8: run_exit_handlers (in /usr/lib64/libc-2.17.so) ==9562== by 0x34522D36: exit (in /usr/lib64/libc-2.17.so) ==9562== by 0x3450B55B: (below main) (in /usr/lib64/libc-2.17.so) ==9562== Address 0x35da3020 is 0 bytes inside a block of size 34 free'd ==9562== at 0xB6CC51D: operator delete(void) (vg_replace_malloc.c:586) ==9562== by 0x33B60E22: std::basic_string<char, std::char_traits, std::allocator >::~basic_string() (in /usr/lib64/libstdc++.so.6.0.19) ==9562== by 0x34522CE8: __run_exit_handlers (in /usr/lib64/libc-2.17.so) ==9562== by 0x34522D36: exit (in /usr/lib64/libc-2.17.so) ==9562== by 0x3450B55B: (below main) (in /usr/lib64/libc-2.17.so) ==9562== Block was alloc'd at ==9562== at 0xB6CB593: operator new(unsigned long) (vg_replace_malloc.c:344) ==9562== by 0x33B60CD8: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator const&) (in /usr/lib64/libstdc++.so.6.0.19) ==9562== by 0x114652B: char std::string::_S_construct<char const>(char const, char const, std::allocator const&, std::forward_iterator_tag) (basic_string.tcc:610) ==9562== by 0x113B053: char std::string::_S_construct_aux<char const>(char const, char const*, std::allocator const&, std::false_type) (basic_string.h:5180) ==9562== by 0x112CD50: char std::string::_S_construct<char const>(char const, char const, std::allocator const&) (basic_string.h:5201) ==9562== by 0x11213FB: std::basic_string<char, std::char_traits, std::allocator >::basic_string<std::allocator >(char const*, std::allocator const&) (basic_string.h:3663) ==9562== by 0xE5179F: _GLOBAL__sub_I_logging.cc (in /workspace/Paddle-2.6.0/build/paddle/fluid/eager/auto_code_generator/eager_generator) ==9562== by 0x5C1677C: __libc_csu_init (in /workspace/Paddle-2.6.0/build/paddle/fluid/eager/auto_code_generator/eager_generator) ==9562== by 0x3450B4E4: (below main) (in /usr/lib64/libc-2.17.so) ==9562== ==9562== ==9562== HEAP SUMMARY: ==9562== in use at exit: 7,212,792 bytes in 112,575 blocks ==9562== total heap usage: 746,270 allocs, 633,696 frees, 193,979,195 bytes allocated ==9562== ==9562== LEAK SUMMARY: ==9562== definitely lost: 869,834 bytes in 8,866 blocks ==9562== indirectly lost: 4,778,202 bytes in 98,973 blocks ==9562== possibly lost: 8,575 bytes in 122 blocks ==9562== still reachable: 1,556,181 bytes in 4,614 blocks ==9562== of which reachable via heuristic: ==9562== stdstring : 65,309 bytes in 1,029 blocks ==9562== newarray : 3,080 bytes in 1 blocks ==9562== suppressed: 0 bytes in 0 blocks ==9562== Rerun with --leak-check=full to see details of leaked memory ==9562== ==9562== For lists of detected and suppressed errors, rerun with: -s ==9562== ERROR SUMMARY: 4 errors from 2 contexts (suppressed: 0 from 0)

我在我们这边的Centos机器上编译是可以成功的,你用的是docker编译吗?

ronny1996 commented 8 months ago

你好,请问容器里能正常编译吗?

TimeYWL commented 8 months ago

你好,请问容器里能正常编译吗?

我使用的就是从光源拉取的镜像: docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk23.10.1-py38

llseek commented 8 months ago

我在我们这边的Centos机器上编译是可以成功的,你用的是docker编译吗?

能分享下您的编译环境信息吗?比如python版本、dtk版本

WanwanLinLin commented 8 months ago

我在我们这边的Centos机器上编译是可以成功的,你用的是docker编译吗?

不是,我是CentOS7本地编译的,如果不编译python库的话paddle就能编译成功,否则就失败,报上面那个错误,而且你们提供的官方镜像我没有看到有CentOS7的。