PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.24k stars 5.58k forks source link

PaddlePaddle 2.6.0 buglist, part 1 #60882

Closed jeng1220 closed 7 months ago

jeng1220 commented 9 months ago

bug描述 Describe the Bug

使用Ampere GPU 或 Hopper GPU執行單測有多個錯誤 目前先整理 24 個錯誤: PaddlePaddle 2.6.0 buglist - part 1.xlsx

其他补充信息 Additional Supplementary Information

Paddle version: 2.6.0 Paddle With CUDA: True

OS: ubuntu 22.04 GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: N/A CMake version: version 3.25.1 Libc version: glibc 2.35 Python version: 3.10.12

CUDA version: 12.3.107 Build cuda_12.3.r12.3/compiler.33567101_0 cuDNN version: 8.9.7 Nvidia driver version: 535.129.03 Nvidia driver List: GPU 0: Tesla V100-SXM2-16GB GPU 1: Tesla V100-SXM2-16GB GPU 2: Tesla V100-SXM2-16GB GPU 3: Tesla V100-SXM2-16GB GPU 4: Tesla V100-SXM2-16GB GPU 5: Tesla V100-SXM2-16GB GPU 6: Tesla V100-SXM2-16GB GPU 7: Tesla V100-SXM2-16GB

jeng1220 commented 9 months ago

@onecatcn , 更新 release/2.6 和解決 nvrtc 問題 (#60943) 後,仍有不少錯誤 已更新 buglist 在第一則貼文

jeng1220 commented 9 months ago

@onecatcn , 試了 develop (4db394f5a530e9f1a324ca272fcbe4c442e5a747),仍有12個嚴重錯誤 複現腳本:

#!/bin/bash
set -x
cd paddle/build
ctest --output-on-failure -R test_cuda_graph_partial_graph_static_run
ctest --output-on-failure -R test_graph_reindex
ctest --output-on-failure -R test_cuda_graphed_layer
ctest --output-on-failure -R test_unique
ctest --output-on-failure -R test_weight_decay
ctest --output-on-failure -R test_unique_static_build
ctest --output-on-failure -R test_post_training_quantization_resnet50
ctest --output-on-failure -R test_communicator_half_async
ctest --output-on-failure -R test_trt_convert_scatter
ctest --output-on-failure -R test_trt_convert_assign
ctest --output-on-failure -R test_trt_convert_lookup_table
ctest --output-on-failure -R test_post_training_quantization_mobilenetv1
ctest --output-on-failure -R test_trt_convert_yolo_box #超時

log: unittest-dev.log

下列是在 develop 分支已被修正,但在release/2.6.0沒有,需要 cherry-pick

jeng1220 commented 9 months ago

希望能高優處裡以下單測:

tianshuo78520a commented 9 months ago

希望能高優處裡以下單測:

  • test_layer_norm_op_static_build (Failed)
  • test_layer_norm_op (Failed)
  • test_llm_int8_linear (Failed)
  • test_post_training_quantization_resnet50 (Failed)
  • test_post_training_quantization_mobilenetv1 (Failed)

我已经找负责人排查了,会尽快解决

tianshuo78520a commented 9 months ago

已提PR 61284修复下面单测 test_post_training_quantization_resnet50 (Failed) test_post_training_quantization_mobilenetv1 (Failed)

jeng1220 commented 9 months ago

@leo0519 提交了 #61377. 其在 develop 修復了

XieYunshen commented 9 months ago

test_layer_norm_op_static_build (Failed) test_layer_norm_op (Failed) 单测已修复 https://github.com/PaddlePaddle/Paddle/pull/61631

onecatcn commented 8 months ago

提交了https://github.com/PaddlePaddle/Paddle/pull/61591 修复:test_llm_int8_linear (Failed)

onecatcn commented 8 months ago

we are not able to reproduce the failures in follow 2 tests: test_sparse_fused_attention_op trt_dynamic_shape_test

zlsh80826 commented 8 months ago

@onecatcn We solved both test_sparse_fused_attention_op and trt_dynamic_shape_test. I will submit a PR to fix them.

tianshuo78520a commented 8 months ago

test_communicator_half_async Fix. PR: https://github.com/PaddlePaddle/Paddle/pull/62092

jeng1220 commented 8 months ago

@tianshuo78520a ,

test_communicator_half_async passed with V100 but still failed with Ampere GPU:

53/122 Test  #509: test_communicator_half_async .........................................***Failed    2.91 sec
[2024-02-27 13:01:33,775] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
[2024-02-27 13:01:33,776] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
I0227 13:01:33.808914 29935 program_interpreter.cc:212] New Executor is Running.
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:739: UserWarning: The PS mode must use MemorySparseTable.
  warnings.warn("The PS mode must use MemorySparseTable.")
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:750: UserWarning: The shard_num of sparse table is not set, use default value 1000 in cpups.
  warnings.warn(
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:772: UserWarning: The accessor of sparse table is not set, use default value.
  warnings.warn(
I0227 13:01:33.823201 29935 server.cpp:1107] Server[paddle::distributed::DownpourPsClientService] is serving on port=8500.
I0227 13:01:33.823215 29935 server.cpp:1110] Check out http://8dc94ca26cec:8500 in web browser.
I0227 13:01:33.823283 29935 brpc_ps_client.cc:131] BrpcPsClient Service addr: 192.168.128.5, 8500, 0
/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/role_maker.py:329: UserWarning: gloo is not initialized, will not communicator with other nodes
  warnings.warn(self._err_init)
/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/role_maker.py:[373](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L373): UserWarning: gloo is not initialized, will not communicator with other nodes
  warnings.warn(self._err_init)
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:1249: UserWarning: gloo may not initialize correctly
  warnings.warn("gloo may not initialize correctly")
I0227 13:01:33.824164 29935 brpc_ps_client.cc:200] Client connect success:192.168.128.5:8500,
E0227 13:01:33.824612 30340 brpc_ps_client.cc:386] resquest cmd_id:11 failed, err:[E111]Fail to connect Socket{id=3 addr=127.0.0.1:52887} (0x0x558e724878d0): Connection refused [R1][E112]Not connected to 127.0.0.1:52887 yet, server_id=0 [R2][E112]Not connected to 127.0.0.1:52887 yet, server_id=0 [R3][E112]Not connected to 127.0.0.1:52887 yet, server_id=0
F0227 13:01:33.824656 29935 fleet.cc:445] Check failed: status == 0 push dense param failed, status[-1]
*** Check failure stack trace: ***
    @     0x7f6b458e3cd3  google::LogMessage::Fail()
    @     0x7f6b458e6254  google::LogMessage::SendToLog()
    @     0x7f6b458e[381](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L381)0  google::LogMessage::Flush()
    @     0x7f6b458e67cf  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f6b467c609a  paddle::distributed::FleetWrapper::PushDenseParamSync()
    @     0x7f6b45410064  (unknown)
    @     0x7f6b450edae3  (unknown)
    @     0x558e6b13810e  (unknown)
    @     0x558e6b12ea7b  _PyObject_MakeTpCall
    @     0x558e6b146acb  (unknown)
    @     0x558e6b126cfa  _PyEval_EvalFrameDefault
    @     0x558e6b1[389](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L389)fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1467f1  (unknown)
    @     0x558e6b126cfa  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
Subprocess aborted
tianshuo78520a commented 8 months ago

@tianshuo78520a ,

test_communicator_half_async passed with V100 but still failed with Ampere GPU:

53/122 Test  #509: test_communicator_half_async .........................................***Failed    2.91 sec
[2024-02-27 13:01:33,775] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
[2024-02-27 13:01:33,776] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
I0227 13:01:33.808914 29935 program_interpreter.cc:212] New Executor is Running.
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:739: UserWarning: The PS mode must use MemorySparseTable.
  warnings.warn("The PS mode must use MemorySparseTable.")
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:750: UserWarning: The shard_num of sparse table is not set, use default value 1000 in cpups.
  warnings.warn(
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:772: UserWarning: The accessor of sparse table is not set, use default value.
  warnings.warn(
I0227 13:01:33.823201 29935 server.cpp:1107] Server[paddle::distributed::DownpourPsClientService] is serving on port=8500.
I0227 13:01:33.823215 29935 server.cpp:1110] Check out http://8dc94ca26cec:8500 in web browser.
I0227 13:01:33.823283 29935 brpc_ps_client.cc:131] BrpcPsClient Service addr: 192.168.128.5, 8500, 0
/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/role_maker.py:329: UserWarning: gloo is not initialized, will not communicator with other nodes
  warnings.warn(self._err_init)
/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/role_maker.py:[373](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L373): UserWarning: gloo is not initialized, will not communicator with other nodes
  warnings.warn(self._err_init)
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:1249: UserWarning: gloo may not initialize correctly
  warnings.warn("gloo may not initialize correctly")
I0227 13:01:33.824164 29935 brpc_ps_client.cc:200] Client connect success:192.168.128.5:8500,
E0227 13:01:33.824612 30340 brpc_ps_client.cc:386] resquest cmd_id:11 failed, err:[E111]Fail to connect Socket{id=3 addr=127.0.0.1:52887} (0x0x558e724878d0): Connection refused [R1][E112]Not connected to 127.0.0.1:52887 yet, server_id=0 [R2][E112]Not connected to 127.0.0.1:52887 yet, server_id=0 [R3][E112]Not connected to 127.0.0.1:52887 yet, server_id=0
F0227 13:01:33.824656 29935 fleet.cc:445] Check failed: status == 0 push dense param failed, status[-1]
*** Check failure stack trace: ***
    @     0x7f6b458e3cd3  google::LogMessage::Fail()
    @     0x7f6b458e6254  google::LogMessage::SendToLog()
    @     0x7f6b458e[381](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L381)0  google::LogMessage::Flush()
    @     0x7f6b458e67cf  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f6b467c609a  paddle::distributed::FleetWrapper::PushDenseParamSync()
    @     0x7f6b45410064  (unknown)
    @     0x7f6b450edae3  (unknown)
    @     0x558e6b13810e  (unknown)
    @     0x558e6b12ea7b  _PyObject_MakeTpCall
    @     0x558e6b146acb  (unknown)
    @     0x558e6b126cfa  _PyEval_EvalFrameDefault
    @     0x558e6b1[389](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L389)fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1467f1  (unknown)
    @     0x558e6b126cfa  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
Subprocess aborted

We attempted to replicate it in the A100 environment, but it was not successful. Could you please confirm if there is any merged repair code?

GhostScreaming commented 8 months ago

test_semi_auto_parallel_hybrid_strategy 在本地复测,release/2.6分支可能出现曹氏问题。Docker 容器需要设置足够大的shared_memory,否则NCCL通信可能报错 修复PR 62278

zlsh80826 commented 8 months ago

@onecatcn https://github.com/PaddlePaddle/Paddle/pull/62477 這個 PR 修復 test_sparse_fused_attention_optrt_dynamic_shape_test

jeng1220 commented 8 months ago

@tianshuo78520a , test_communicator_half_async passed with V100 but still failed with Ampere GPU:

53/122 Test  #509: test_communicator_half_async .........................................***Failed    2.91 sec
...
*** Check failure stack trace: ***
    @     0x7f6b458e3cd3  google::LogMessage::Fail()
    @     0x7f6b458e6254  google::LogMessage::SendToLog()
    @     0x7f6b458e[381](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L381)0  google::LogMessage::Flush()
    @     0x7f6b458e67cf  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f6b467c609a  paddle::distributed::FleetWrapper::PushDenseParamSync()
    @     0x7f6b45410064  (unknown)
...
    @     0x558e6b1389fc  _PyFunction_Vectorcall
Subprocess aborted

We attempted to replicate it in the A100 environment, but it was not successful. Could you please confirm if there is any merged repair code?

After discussion, the test_communicator_half_async only affects CPU computing, so we will disable it on our side.