PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.19k stars 5.57k forks source link

Compiling Paddle on Ubuntu 22.04 and Encountering Runtime Errors with BRPC Libraries #52842

Closed eee4017 closed 1 year ago

eee4017 commented 1 year ago

问题描述 Issue Description

We attempted to compile Paddle on Ubuntu 22.04, even though it was not initially supported. However, we managed to successfully compile it after making certain modifications. Specifically, we upgraded certain libraries and added some compiler flags to the cmake configuration to build Paddle on Ubuntu 22.04. Here are the steps we followed:

  1. In cmake/external/protobuf.cmake, we added the flag -DCMAKE_CXX_STANDARD=14 to build protobuf with Cxx14.
  2. In cmake/external/gloo.cmake, we added the compile flag -Wno-error=uninitialized by setting the variable GLOO_CXX_FLAGS and appending it to CMAKE_CXX_FLAGS.
    set(GLOO_CXX_FLAGS "-Wno-error=uninitialized")
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${GLOO_CXX_FLAGS}")
  3. In cmake/external/brpc.cmake, we updated the brpc library to version 1.4.0 by setting the GIT_TAG to 1.4.0.
  4. In cmake/generic.cmake, we linked Paddle with OpenSSL by adding the following line to line 140: target_link_libraries(${TARGET_NAME} OpenSSL::SSL OpenSSL::Crypto)
  5. We included the header <cstddef> in paddle/fluid/memory/allocation/memory_block.h.

However, we encountered runtime errors related to the third-party BRPC libraries, which caused subprocesses to abort on multiple tests. We tested various versions of the BRPC libraries, including 1.4.0, 1.3.0, and 1.2.0 (which required additional CMake modifications for this version). Unfortunately, the error message persisted across all versions.

We also built BRPC independently on Ubuntu 22.04 and ran its unit tests, which passed successfully. The error message that we encountered is shown below:

E0321 12:00:13.890467 22940 variable.cpp:179] Already exposed `pid' whose value is `22940'
E0321 12:00:13.890648 22940 variable.cpp:179] Already exposed `ppid' whose value is `271'
E0321 12:00:13.890681 22940 variable.cpp:179] Already exposed `pgrp' whose value is `7'
E0321 12:00:13.890710 22940 variable.cpp:179] Already exposed `process_username' whose value is `unknown (No such device or address)'
E0321 12:00:13.890838 22940 variable.cpp:179] Already exposed `process_faults_minor_second' whose value is `0'
E0321 12:00:13.890882 22940 variable.cpp:179] Already exposed `process_priority' whose value is `20'
E0321 12:00:13.890914 22940 variable.cpp:179] Already exposed `process_nice' whose value is `0'
E0321 12:00:13.890945 22940 variable.cpp:179] Already exposed `process_thread_count' whose value is `3'
E0321 12:00:13.890974 22940 variable.cpp:179] Already exposed `process_fd_count' whose value is `4'
E0321 12:00:13.891005 22940 variable.cpp:179] Already exposed `process_memory_virtual' whose value is `1189826560'
E0321 12:00:13.891041 22940 variable.cpp:179] Already exposed `process_memory_resident' whose value is `177086464'
E0321 12:00:13.891070 22940 variable.cpp:179] Already exposed `process_memory_shared' whose value is `143224832'
E0321 12:00:13.891099 22940 variable.cpp:179] Already exposed `process_memory_text' whose value is `2158592'
E0321 12:00:13.891130 22940 variable.cpp:179] Already exposed `process_memory_data_and_stack' whose value is `187883520'
E0321 12:00:13.891160 22940 variable.cpp:179] Already exposed `system_loadavg_1m' whose value is `9.74'
E0321 12:00:13.891206 22940 variable.cpp:179] Already exposed `system_loadavg_5m' whose value is `10.51'
E0321 12:00:13.891244 22940 variable.cpp:179] Already exposed `system_loadavg_15m' whose value is `9.09'
E0321 12:00:13.891301 22940 variable.cpp:179] Already exposed `process_io_read_bytes_second' whose value is `0'
E0321 12:00:13.891332 22940 variable.cpp:179] Already exposed `process_io_write_bytes_second' whose value is `0'
E0321 12:00:13.891369 22940 variable.cpp:179] Already exposed `process_io_read_second' whose value is `0'
E0321 12:00:13.891400 22940 variable.cpp:179] Already exposed `process_io_write_second' whose value is `0'
E0321 12:00:13.891434 22940 variable.cpp:179] Already exposed `process_disk_read_bytes_second' whose value is `0'
E0321 12:00:13.891465 22940 variable.cpp:179] Already exposed `process_disk_write_bytes_second' whose value is `0'

版本&环境信息 Version & Environment Information


Paddle version: N/A Paddle With CUDA: N/A

OS: ubuntu 22.04 GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 Clang version: N/A CMake version: version 3.25.1 Libc version: glibc 2.35 Python version: 3.10.6

CUDA version: 12.1.66 Build cuda_12.1.r12.1/compiler.32415258_0 cuDNN version: 8.8.1 Nvidia driver version: 525.85.12 Nvidia driver List: GPU 0: NVIDIA Graphics Device


tianshuo78520a commented 1 year ago

问题已经复现,正在排查

liuzhenhai93 commented 1 year ago

问题已定位: 1、是 inference 相关的测试链接了两份 brpc, 分别是静态链接和动态库链入的(libpaddle_inference.so) 2、gcc 11 可能使用了更激进的 inline 策略,导致了构造和析构函数链接到的静态变量来自不同的两个copy

修复思路: 1、梳理库之间的链接逻辑,避免重复链接 2、gcc 版本回退 3、-Wl,-Bsymbolic 编译选项

方案3已测试过,可行 https://github.com/PaddlePaddle/Paddle/pull/53512

eee4017 commented 1 year ago

We discovered numerous test failures in the multi-GPU system on Ubuntu 22.04, possibly related to collective communication. The root cause remains uncertain, but it could be attributed to either brpc or gloo.

Error summary: A segmentation fault was detected by the operating system, along with TCP receive and send errors related to connection reset by peer and broken pipe, respectively.

The following is an abridged version of the error message:

[...]
Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid:  23859) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000013c159 PyBytes_FromStringAndSize()  ???:0
 2 0x000000000425d3f0 std::mersenne_twister_engine<unsigned long, 64ul, 312ul, 156ul, 31ul, 13043109905998158313ul, 29ul, 6148914691236517205ul, 17ul, 8202884508482404352ul, 37ul, 18444473444759240704ul, 43ul, 6364136223846793005ul>::seed<std::seed_seq>()  ???:0
 3 0x0000000003e622f8 std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::vector()  ???:0
 4 0x000000000015c99e PyObject_CallFunctionObjArgs()  ???:0
[...]
FatalError: `Segmentation fault` is detected by the operating system.
  [TimeInfo: *** Aborted at 1683483375 (unix time) try "date -d @1683483375" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x159b) received by PID 5531 (TID 0x7fc38451f1c0) from PID 5531 ***]

E[2023-05-07 18:16:19,787] [ WARNING] fleet.py:296 - The dygraph parallel environment has been initialized.
E[2023-05-07 18:16:19,788] [ WARNING] fleet.py:296 - The dygraph parallel environment has been initialized.
E[2023-05-07 18:16:19,788] [ WARNING] fleet.py:296 - The dygraph parallel environment has been initialized.
E[2023-05-07 18:16:19,789] [ WARNING] fleet.py:296 - The dygraph parallel environment has been initialized.
E
======================================================================
ERROR: test_base_case (__main__.TestPyLayer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/paddle/paddle/build/python/paddle/fluid/tests/unittests/dygraph_recompute_hybrid.py", line 160, in setUp
    fleet.init(is_collective=True, strategy=strategy)
  File "/opt/paddle/paddle/build/python/paddle/distributed/fleet/fleet.py", line 317, in init
    self._init_hybrid_parallel_env()
  File "/opt/paddle/paddle/build/python/paddle/distributed/fleet/fleet.py", line 423, in _init_hybrid_parallel_env
    self._hcg = tp.HybridCommunicateGroup(self._topology)
  File "/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/topology.py", line 171, in __init__
    self._mp_group, self._mp_comm_group = self._set_comm_group("model")
  File "/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/topology.py", line 256, in _set_comm_group
    comm_group = paddle.distributed.new_group(ranks=group)
  File "/opt/paddle/paddle/build/python/paddle/distributed/collective.py", line 424, in new_group
    paddle.distributed.barrier(group=group)
  File "/opt/paddle/paddle/build/python/paddle/distributed/collective.py", line 280, in barrier
    task = group.process_group.barrier()
ValueError: (InvalidArgument) TCP receive error. Details: Connection reset by peer.
  [Hint: Expected byte_received > 0, but received byte_received:-1 <= 0:0.] (at /opt/paddle/paddle/paddle/fluid/distributed/store/tcp_utils.h:102)

A total of 41 tests have failed, with details listed below:

- test_pipeline_parallel
- test_collective_split_embedding
- test_collective_allgather_object_api
- test_collective_alltoall_api
- test_collective_alltoall_single
- test_collective_alltoall_single_api
- test_collective_barrier_api
- test_collective_broadcast_api
- test_collective_global_gather
- test_collective_global_scatter
- test_collective_isend_irecv_api
- test_collective_process_group
- test_collective_reduce_api
- test_collective_reduce_scatter_api
- test_collective_scatter_api
- test_collective_sendrecv_api
- test_collective_split_col_linear
- test_collective_split_row_linear
- test_communication_stream_allgather_api
- test_communication_stream_allreduce_api
- test_communication_stream_alltoall_api
- test_communication_stream_alltoall_single_api
- test_communication_stream_broadcast_api
- test_communication_stream_reduce_api
- test_communication_stream_reduce_scatter_api
- test_communication_stream_scatter_api
- test_communication_stream_sendrecv_api
- test_eager_dist_api
- test_world_size_and_rank
- test_parallel_margin_cross_entropy
- test_parallel_dygraph_transformer
- test_parallel_dygraph_mp_layers
- test_tcp_store
- test_dygraph_sharding_stage3_for_eager
- test_parallel_dygraph_pipeline_parallel
- test_parallel_dygraph_pipeline_parallel_with_virtual_stage
- test_parallel_class_center_sample
- test_dygraph_sharding_stage2
- test_parallel_dygraph_control_flow
- test_parallel_dygraph_sharding_parallel
- test_parallel_dygraph_tensor_parallel
- test_dygraph_group_sharded_api_for_eager
- test_parallel_dygraph_unused_variables
- test_parallel_dygraph_qat
- test_parallel_dygraph_sparse_embedding
- test_parallel_dygraph_sparse_embedding_over_height
- test_dist_mnist_dgc_nccl
liuzhenhai93 commented 1 year ago

53512 can not resolve your problem?

eee4017 commented 1 year ago

53512 can not resolve your problem?

Yes, the initial linking issue has been resolved with #53512. However, tests associated with the Paddle distributed module continue to fail for reasons that remain unclear.

liuzhenhai93 commented 1 year ago

cause of test failure like test_collective_xxx: logic error

image image
danleifeng commented 1 year ago

We tests for some distributed tests, there are following errors: 92ca90e916b0fd3718f7f6b39d254161 5ba64a2a777c87df53825c2637f68def

At the moment, the test errors are unrelated to BRPC libraries.

tianshuo78520a commented 1 year ago

https://github.com/PaddlePaddle/Paddle/pull/53795 解决问题

zlsh80826 commented 1 year ago

问题已定位: 1、是 inference 相关的测试链接了两份 brpc, 分别是静态链接和动态库链入的(libpaddle_inference.so) 2、gcc 11 可能使用了更激进的 inline 策略,导致了构造和析构函数链接到的静态变量来自不同的两个copy

修复思路: 1、梳理库之间的链接逻辑,避免重复链接 2、gcc 版本回退 3、-Wl,-Bsymbolic 编译选项

方案3已测试过,可行 #53512

Hi @liuzhenhai93 我們發現使用 -Wl,-Bsymbolic 可以解決 brpc 的問題,但也衍生出新的問題,使用 -Wl,-Bsymbolic 會導致 Singleton 在不同 file 中被多個創建,例如 test_analysis_predictor 會因為創建多個 ResourceManager 導致錯誤,所以根本解決辦法應該是梳理庫之間的邏輯關係避免重複鏈接。

liuzhenhai93 commented 1 year ago

version script which provide fine-grained control over symbol export, may fix this. use version script to hide brpc symbols from export

zlsh80826 commented 1 year ago

Actually, I think the dependencies of libinference is a little messy. I notice that the analysis_predictor lists in both the SRCS and DEPS. However, the analysis_predictor has already been compiled as a static library, why would we add them as one of the SRCS of lib_inference.so? @liuzhenhai93

yuanlehome commented 1 year ago

更新一下: test_analyzer_ner(最初 brpc 报错),后来 https://github.com/PaddlePaddle/Paddle/pull/53512 解决了这个单测的 brpc 报错,这个单测可成功运行。但使用了-Wl,-Bsymbolic 导致了 test_analysis_predictor 报错,这个 PR https://github.com/PaddlePaddle/Paddle/pull/54229 将其移除。

基于当前develop分支,我测试 test_analyzer_ner test_analysis_predictor 均没问题。所以请再次使用 develop 最新 commit 测试一下~

本质上解决问题的是 为单测编译跳过 --version-script 的逻辑。如下 image

yuanlehome commented 1 year ago

Actually, I think the dependencies of libinference is a little messy. I notice that the analysis_predictor lists in both the SRCS and DEPS. However, the analysis_predictor has already been compiled as a static library, why would we add them as one of the SRCS of lib_inference.so? @liuzhenhai93

我们有计划重构inference模块的编译代码,这需要时间。基于上条回复,这似乎不影响此 issue 下涉及的inference单测问题。