Closed eee4017 closed 1 year ago
问题已经复现,正在排查
问题已定位: 1、是 inference 相关的测试链接了两份 brpc, 分别是静态链接和动态库链入的(libpaddle_inference.so) 2、gcc 11 可能使用了更激进的 inline 策略,导致了构造和析构函数链接到的静态变量来自不同的两个copy
修复思路: 1、梳理库之间的链接逻辑,避免重复链接 2、gcc 版本回退 3、-Wl,-Bsymbolic 编译选项
方案3已测试过,可行 https://github.com/PaddlePaddle/Paddle/pull/53512
We discovered numerous test failures in the multi-GPU system on Ubuntu 22.04, possibly related to collective communication. The root cause remains uncertain, but it could be attributed to either brpc or gloo.
Error summary: A segmentation fault was detected by the operating system, along with TCP receive and send errors related to connection reset by peer and broken pipe, respectively.
The following is an abridged version of the error message:
[...]
Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid: 23859) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x000000000013c159 PyBytes_FromStringAndSize() ???:0
2 0x000000000425d3f0 std::mersenne_twister_engine<unsigned long, 64ul, 312ul, 156ul, 31ul, 13043109905998158313ul, 29ul, 6148914691236517205ul, 17ul, 8202884508482404352ul, 37ul, 18444473444759240704ul, 43ul, 6364136223846793005ul>::seed<std::seed_seq>() ???:0
3 0x0000000003e622f8 std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::vector() ???:0
4 0x000000000015c99e PyObject_CallFunctionObjArgs() ???:0
[...]
FatalError: `Segmentation fault` is detected by the operating system.
[TimeInfo: *** Aborted at 1683483375 (unix time) try "date -d @1683483375" if you are using GNU date ***]
[SignalInfo: *** SIGSEGV (@0x159b) received by PID 5531 (TID 0x7fc38451f1c0) from PID 5531 ***]
E[2023-05-07 18:16:19,787] [ WARNING] fleet.py:296 - The dygraph parallel environment has been initialized.
E[2023-05-07 18:16:19,788] [ WARNING] fleet.py:296 - The dygraph parallel environment has been initialized.
E[2023-05-07 18:16:19,788] [ WARNING] fleet.py:296 - The dygraph parallel environment has been initialized.
E[2023-05-07 18:16:19,789] [ WARNING] fleet.py:296 - The dygraph parallel environment has been initialized.
E
======================================================================
ERROR: test_base_case (__main__.TestPyLayer)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/opt/paddle/paddle/build/python/paddle/fluid/tests/unittests/dygraph_recompute_hybrid.py", line 160, in setUp
fleet.init(is_collective=True, strategy=strategy)
File "/opt/paddle/paddle/build/python/paddle/distributed/fleet/fleet.py", line 317, in init
self._init_hybrid_parallel_env()
File "/opt/paddle/paddle/build/python/paddle/distributed/fleet/fleet.py", line 423, in _init_hybrid_parallel_env
self._hcg = tp.HybridCommunicateGroup(self._topology)
File "/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/topology.py", line 171, in __init__
self._mp_group, self._mp_comm_group = self._set_comm_group("model")
File "/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/topology.py", line 256, in _set_comm_group
comm_group = paddle.distributed.new_group(ranks=group)
File "/opt/paddle/paddle/build/python/paddle/distributed/collective.py", line 424, in new_group
paddle.distributed.barrier(group=group)
File "/opt/paddle/paddle/build/python/paddle/distributed/collective.py", line 280, in barrier
task = group.process_group.barrier()
ValueError: (InvalidArgument) TCP receive error. Details: Connection reset by peer.
[Hint: Expected byte_received > 0, but received byte_received:-1 <= 0:0.] (at /opt/paddle/paddle/paddle/fluid/distributed/store/tcp_utils.h:102)
A total of 41 tests have failed, with details listed below:
- test_pipeline_parallel
- test_collective_split_embedding
- test_collective_allgather_object_api
- test_collective_alltoall_api
- test_collective_alltoall_single
- test_collective_alltoall_single_api
- test_collective_barrier_api
- test_collective_broadcast_api
- test_collective_global_gather
- test_collective_global_scatter
- test_collective_isend_irecv_api
- test_collective_process_group
- test_collective_reduce_api
- test_collective_reduce_scatter_api
- test_collective_scatter_api
- test_collective_sendrecv_api
- test_collective_split_col_linear
- test_collective_split_row_linear
- test_communication_stream_allgather_api
- test_communication_stream_allreduce_api
- test_communication_stream_alltoall_api
- test_communication_stream_alltoall_single_api
- test_communication_stream_broadcast_api
- test_communication_stream_reduce_api
- test_communication_stream_reduce_scatter_api
- test_communication_stream_scatter_api
- test_communication_stream_sendrecv_api
- test_eager_dist_api
- test_world_size_and_rank
- test_parallel_margin_cross_entropy
- test_parallel_dygraph_transformer
- test_parallel_dygraph_mp_layers
- test_tcp_store
- test_dygraph_sharding_stage3_for_eager
- test_parallel_dygraph_pipeline_parallel
- test_parallel_dygraph_pipeline_parallel_with_virtual_stage
- test_parallel_class_center_sample
- test_dygraph_sharding_stage2
- test_parallel_dygraph_control_flow
- test_parallel_dygraph_sharding_parallel
- test_parallel_dygraph_tensor_parallel
- test_dygraph_group_sharded_api_for_eager
- test_parallel_dygraph_unused_variables
- test_parallel_dygraph_qat
- test_parallel_dygraph_sparse_embedding
- test_parallel_dygraph_sparse_embedding_over_height
- test_dist_mnist_dgc_nccl
53512 can not resolve your problem?
53512 can not resolve your problem?
Yes, the initial linking issue has been resolved with #53512. However, tests associated with the Paddle distributed module continue to fail for reasons that remain unclear.
cause of test failure like test_collective_xxx: logic error
We tests for some distributed tests, there are following errors:
At the moment, the test errors are unrelated to BRPC libraries.
问题已定位: 1、是 inference 相关的测试链接了两份 brpc, 分别是静态链接和动态库链入的(libpaddle_inference.so) 2、gcc 11 可能使用了更激进的 inline 策略,导致了构造和析构函数链接到的静态变量来自不同的两个copy
修复思路: 1、梳理库之间的链接逻辑,避免重复链接 2、gcc 版本回退 3、-Wl,-Bsymbolic 编译选项
方案3已测试过,可行 #53512
Hi @liuzhenhai93
我們發現使用 -Wl,-Bsymbolic
可以解決 brpc 的問題,但也衍生出新的問題,使用 -Wl,-Bsymbolic
會導致 Singleton 在不同 file 中被多個創建,例如 test_analysis_predictor
會因為創建多個 ResourceManager
導致錯誤,所以根本解決辦法應該是梳理庫之間的邏輯關係避免重複鏈接。
version script which provide fine-grained control over symbol export, may fix this. use version script to hide brpc symbols from export
Actually, I think the dependencies of libinference is a little messy. I notice that the analysis_predictor
lists in both the SRCS and DEPS. However, the analysis_predictor has already been compiled as a static library, why would we add them as one of the SRCS of lib_inference.so? @liuzhenhai93
更新一下: test_analyzer_ner(最初 brpc 报错),后来 https://github.com/PaddlePaddle/Paddle/pull/53512 解决了这个单测的 brpc 报错,这个单测可成功运行。但使用了-Wl,-Bsymbolic 导致了 test_analysis_predictor 报错,这个 PR https://github.com/PaddlePaddle/Paddle/pull/54229 将其移除。
基于当前develop分支,我测试 test_analyzer_ner test_analysis_predictor 均没问题。所以请再次使用 develop 最新 commit 测试一下~
本质上解决问题的是 为单测编译跳过 --version-script 的逻辑。如下
Actually, I think the dependencies of libinference is a little messy. I notice that the
analysis_predictor
lists in both the SRCS and DEPS. However, the analysis_predictor has already been compiled as a static library, why would we add them as one of the SRCS of lib_inference.so? @liuzhenhai93
我们有计划重构inference模块的编译代码,这需要时间。基于上条回复,这似乎不影响此 issue 下涉及的inference单测问题。
问题描述 Issue Description
We attempted to compile Paddle on Ubuntu 22.04, even though it was not initially supported. However, we managed to successfully compile it after making certain modifications. Specifically, we upgraded certain libraries and added some compiler flags to the cmake configuration to build Paddle on Ubuntu 22.04. Here are the steps we followed:
cmake/external/protobuf.cmake
, we added the flag-DCMAKE_CXX_STANDARD=14
to build protobuf with Cxx14.cmake/external/gloo.cmake
, we added the compile flag -Wno-error=uninitialized by setting the variable GLOO_CXX_FLAGS and appending it to CMAKE_CXX_FLAGS.cmake/external/brpc.cmake
, we updated the brpc library to version 1.4.0 by setting the GIT_TAG to 1.4.0.cmake/generic.cmake
, we linked Paddle with OpenSSL by adding the following line to line 140:target_link_libraries(${TARGET_NAME} OpenSSL::SSL OpenSSL::Crypto)
<cstddef>
inpaddle/fluid/memory/allocation/memory_block.h
.However, we encountered runtime errors related to the third-party BRPC libraries, which caused subprocesses to abort on multiple tests. We tested various versions of the BRPC libraries, including 1.4.0, 1.3.0, and 1.2.0 (which required additional CMake modifications for this version). Unfortunately, the error message persisted across all versions.
We also built BRPC independently on Ubuntu 22.04 and ran its unit tests, which passed successfully. The error message that we encountered is shown below:
版本&环境信息 Version & Environment Information
Paddle version: N/A Paddle With CUDA: N/A
OS: ubuntu 22.04 GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 Clang version: N/A CMake version: version 3.25.1 Libc version: glibc 2.35 Python version: 3.10.6
CUDA version: 12.1.66 Build cuda_12.1.r12.1/compiler.32415258_0 cuDNN version: 8.8.1 Nvidia driver version: 525.85.12 Nvidia driver List: GPU 0: NVIDIA Graphics Device