Closed jeng1220 closed 7 months ago
@onecatcn , 更新 release/2.6 和解決 nvrtc 問題 (#60943) 後,仍有不少錯誤 已更新 buglist 在第一則貼文
@onecatcn , 試了 develop (4db394f5a530e9f1a324ca272fcbe4c442e5a747),仍有12個嚴重錯誤 複現腳本:
#!/bin/bash
set -x
cd paddle/build
ctest --output-on-failure -R test_cuda_graph_partial_graph_static_run
ctest --output-on-failure -R test_graph_reindex
ctest --output-on-failure -R test_cuda_graphed_layer
ctest --output-on-failure -R test_unique
ctest --output-on-failure -R test_weight_decay
ctest --output-on-failure -R test_unique_static_build
ctest --output-on-failure -R test_post_training_quantization_resnet50
ctest --output-on-failure -R test_communicator_half_async
ctest --output-on-failure -R test_trt_convert_scatter
ctest --output-on-failure -R test_trt_convert_assign
ctest --output-on-failure -R test_trt_convert_lookup_table
ctest --output-on-failure -R test_post_training_quantization_mobilenetv1
ctest --output-on-failure -R test_trt_convert_yolo_box #超時
log: unittest-dev.log
下列是在 develop 分支已被修正,但在release/2.6.0沒有,需要 cherry-pick
希望能高優處裡以下單測:
希望能高優處裡以下單測:
- test_layer_norm_op_static_build (Failed)
- test_layer_norm_op (Failed)
- test_llm_int8_linear (Failed)
- test_post_training_quantization_resnet50 (Failed)
- test_post_training_quantization_mobilenetv1 (Failed)
我已经找负责人排查了,会尽快解决
已提PR 61284修复下面单测 test_post_training_quantization_resnet50 (Failed) test_post_training_quantization_mobilenetv1 (Failed)
@leo0519 提交了 #61377. 其在 develop 修復了
test_layer_norm_op_static_build (Failed) test_layer_norm_op (Failed) 单测已修复 https://github.com/PaddlePaddle/Paddle/pull/61631
提交了https://github.com/PaddlePaddle/Paddle/pull/61591 修复:test_llm_int8_linear (Failed)
we are not able to reproduce the failures in follow 2 tests: test_sparse_fused_attention_op trt_dynamic_shape_test
@onecatcn
We solved both test_sparse_fused_attention_op
and trt_dynamic_shape_test
. I will submit a PR to fix them.
test_communicator_half_async Fix. PR: https://github.com/PaddlePaddle/Paddle/pull/62092
@tianshuo78520a ,
test_communicator_half_async
passed with V100 but still failed with Ampere GPU:
53/122 Test #509: test_communicator_half_async .........................................***Failed 2.91 sec
[2024-02-27 13:01:33,775] [ INFO] distributed_strategy.py:214 - distributed strategy initialized
[2024-02-27 13:01:33,776] [ INFO] distributed_strategy.py:214 - distributed strategy initialized
I0227 13:01:33.808914 29935 program_interpreter.cc:212] New Executor is Running.
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:739: UserWarning: The PS mode must use MemorySparseTable.
warnings.warn("The PS mode must use MemorySparseTable.")
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:750: UserWarning: The shard_num of sparse table is not set, use default value 1000 in cpups.
warnings.warn(
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:772: UserWarning: The accessor of sparse table is not set, use default value.
warnings.warn(
I0227 13:01:33.823201 29935 server.cpp:1107] Server[paddle::distributed::DownpourPsClientService] is serving on port=8500.
I0227 13:01:33.823215 29935 server.cpp:1110] Check out http://8dc94ca26cec:8500 in web browser.
I0227 13:01:33.823283 29935 brpc_ps_client.cc:131] BrpcPsClient Service addr: 192.168.128.5, 8500, 0
/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/role_maker.py:329: UserWarning: gloo is not initialized, will not communicator with other nodes
warnings.warn(self._err_init)
/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/role_maker.py:[373](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L373): UserWarning: gloo is not initialized, will not communicator with other nodes
warnings.warn(self._err_init)
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:1249: UserWarning: gloo may not initialize correctly
warnings.warn("gloo may not initialize correctly")
I0227 13:01:33.824164 29935 brpc_ps_client.cc:200] Client connect success:192.168.128.5:8500,
E0227 13:01:33.824612 30340 brpc_ps_client.cc:386] resquest cmd_id:11 failed, err:[E111]Fail to connect Socket{id=3 addr=127.0.0.1:52887} (0x0x558e724878d0): Connection refused [R1][E112]Not connected to 127.0.0.1:52887 yet, server_id=0 [R2][E112]Not connected to 127.0.0.1:52887 yet, server_id=0 [R3][E112]Not connected to 127.0.0.1:52887 yet, server_id=0
F0227 13:01:33.824656 29935 fleet.cc:445] Check failed: status == 0 push dense param failed, status[-1]
*** Check failure stack trace: ***
@ 0x7f6b458e3cd3 google::LogMessage::Fail()
@ 0x7f6b458e6254 google::LogMessage::SendToLog()
@ 0x7f6b458e[381](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L381)0 google::LogMessage::Flush()
@ 0x7f6b458e67cf google::LogMessageFatal::~LogMessageFatal()
@ 0x7f6b467c609a paddle::distributed::FleetWrapper::PushDenseParamSync()
@ 0x7f6b45410064 (unknown)
@ 0x7f6b450edae3 (unknown)
@ 0x558e6b13810e (unknown)
@ 0x558e6b12ea7b _PyObject_MakeTpCall
@ 0x558e6b146acb (unknown)
@ 0x558e6b126cfa _PyEval_EvalFrameDefault
@ 0x558e6b1[389](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L389)fc _PyFunction_Vectorcall
@ 0x558e6b12145c _PyEval_EvalFrameDefault
@ 0x558e6b1389fc _PyFunction_Vectorcall
@ 0x558e6b12145c _PyEval_EvalFrameDefault
@ 0x558e6b1389fc _PyFunction_Vectorcall
@ 0x558e6b1235d7 _PyEval_EvalFrameDefault
@ 0x558e6b1389fc _PyFunction_Vectorcall
@ 0x558e6b1235d7 _PyEval_EvalFrameDefault
@ 0x558e6b1389fc _PyFunction_Vectorcall
@ 0x558e6b1235d7 _PyEval_EvalFrameDefault
@ 0x558e6b1389fc _PyFunction_Vectorcall
@ 0x558e6b1235d7 _PyEval_EvalFrameDefault
@ 0x558e6b1389fc _PyFunction_Vectorcall
@ 0x558e6b1235d7 _PyEval_EvalFrameDefault
@ 0x558e6b1389fc _PyFunction_Vectorcall
@ 0x558e6b1235d7 _PyEval_EvalFrameDefault
@ 0x558e6b1467f1 (unknown)
@ 0x558e6b126cfa _PyEval_EvalFrameDefault
@ 0x558e6b1389fc _PyFunction_Vectorcall
@ 0x558e6b12145c _PyEval_EvalFrameDefault
@ 0x558e6b1389fc _PyFunction_Vectorcall
Subprocess aborted
@tianshuo78520a ,
test_communicator_half_async
passed with V100 but still failed with Ampere GPU:53/122 Test #509: test_communicator_half_async .........................................***Failed 2.91 sec [2024-02-27 13:01:33,775] [ INFO] distributed_strategy.py:214 - distributed strategy initialized [2024-02-27 13:01:33,776] [ INFO] distributed_strategy.py:214 - distributed strategy initialized I0227 13:01:33.808914 29935 program_interpreter.cc:212] New Executor is Running. /opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:739: UserWarning: The PS mode must use MemorySparseTable. warnings.warn("The PS mode must use MemorySparseTable.") /opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:750: UserWarning: The shard_num of sparse table is not set, use default value 1000 in cpups. warnings.warn( /opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:772: UserWarning: The accessor of sparse table is not set, use default value. warnings.warn( I0227 13:01:33.823201 29935 server.cpp:1107] Server[paddle::distributed::DownpourPsClientService] is serving on port=8500. I0227 13:01:33.823215 29935 server.cpp:1110] Check out http://8dc94ca26cec:8500 in web browser. I0227 13:01:33.823283 29935 brpc_ps_client.cc:131] BrpcPsClient Service addr: 192.168.128.5, 8500, 0 /opt/paddle/paddle/build/python/paddle/distributed/fleet/base/role_maker.py:329: UserWarning: gloo is not initialized, will not communicator with other nodes warnings.warn(self._err_init) /opt/paddle/paddle/build/python/paddle/distributed/fleet/base/role_maker.py:[373](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L373): UserWarning: gloo is not initialized, will not communicator with other nodes warnings.warn(self._err_init) /opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:1249: UserWarning: gloo may not initialize correctly warnings.warn("gloo may not initialize correctly") I0227 13:01:33.824164 29935 brpc_ps_client.cc:200] Client connect success:192.168.128.5:8500, E0227 13:01:33.824612 30340 brpc_ps_client.cc:386] resquest cmd_id:11 failed, err:[E111]Fail to connect Socket{id=3 addr=127.0.0.1:52887} (0x0x558e724878d0): Connection refused [R1][E112]Not connected to 127.0.0.1:52887 yet, server_id=0 [R2][E112]Not connected to 127.0.0.1:52887 yet, server_id=0 [R3][E112]Not connected to 127.0.0.1:52887 yet, server_id=0 F0227 13:01:33.824656 29935 fleet.cc:445] Check failed: status == 0 push dense param failed, status[-1] *** Check failure stack trace: *** @ 0x7f6b458e3cd3 google::LogMessage::Fail() @ 0x7f6b458e6254 google::LogMessage::SendToLog() @ 0x7f6b458e[381](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L381)0 google::LogMessage::Flush() @ 0x7f6b458e67cf google::LogMessageFatal::~LogMessageFatal() @ 0x7f6b467c609a paddle::distributed::FleetWrapper::PushDenseParamSync() @ 0x7f6b45410064 (unknown) @ 0x7f6b450edae3 (unknown) @ 0x558e6b13810e (unknown) @ 0x558e6b12ea7b _PyObject_MakeTpCall @ 0x558e6b146acb (unknown) @ 0x558e6b126cfa _PyEval_EvalFrameDefault @ 0x558e6b1[389](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L389)fc _PyFunction_Vectorcall @ 0x558e6b12145c _PyEval_EvalFrameDefault @ 0x558e6b1389fc _PyFunction_Vectorcall @ 0x558e6b12145c _PyEval_EvalFrameDefault @ 0x558e6b1389fc _PyFunction_Vectorcall @ 0x558e6b1235d7 _PyEval_EvalFrameDefault @ 0x558e6b1389fc _PyFunction_Vectorcall @ 0x558e6b1235d7 _PyEval_EvalFrameDefault @ 0x558e6b1389fc _PyFunction_Vectorcall @ 0x558e6b1235d7 _PyEval_EvalFrameDefault @ 0x558e6b1389fc _PyFunction_Vectorcall @ 0x558e6b1235d7 _PyEval_EvalFrameDefault @ 0x558e6b1389fc _PyFunction_Vectorcall @ 0x558e6b1235d7 _PyEval_EvalFrameDefault @ 0x558e6b1389fc _PyFunction_Vectorcall @ 0x558e6b1235d7 _PyEval_EvalFrameDefault @ 0x558e6b1467f1 (unknown) @ 0x558e6b126cfa _PyEval_EvalFrameDefault @ 0x558e6b1389fc _PyFunction_Vectorcall @ 0x558e6b12145c _PyEval_EvalFrameDefault @ 0x558e6b1389fc _PyFunction_Vectorcall Subprocess aborted
We attempted to replicate it in the A100 environment, but it was not successful. Could you please confirm if there is any merged repair code?
test_semi_auto_parallel_hybrid_strategy 在本地复测,release/2.6分支可能出现曹氏问题。Docker 容器需要设置足够大的shared_memory,否则NCCL通信可能报错 修复PR 62278
@onecatcn
https://github.com/PaddlePaddle/Paddle/pull/62477 這個 PR 修復 test_sparse_fused_attention_op
和 trt_dynamic_shape_test
@tianshuo78520a ,
test_communicator_half_async
passed with V100 but still failed with Ampere GPU:53/122 Test #509: test_communicator_half_async .........................................***Failed 2.91 sec ... *** Check failure stack trace: *** @ 0x7f6b458e3cd3 google::LogMessage::Fail() @ 0x7f6b458e6254 google::LogMessage::SendToLog() @ 0x7f6b458e[381](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L381)0 google::LogMessage::Flush() @ 0x7f6b458e67cf google::LogMessageFatal::~LogMessageFatal() @ 0x7f6b467c609a paddle::distributed::FleetWrapper::PushDenseParamSync() @ 0x7f6b45410064 (unknown) ... @ 0x558e6b1389fc _PyFunction_Vectorcall Subprocess aborted
We attempted to replicate it in the A100 environment, but it was not successful. Could you please confirm if there is any merged repair code?
After discussion, the test_communicator_half_async
only affects CPU computing, so we will disable it on our side.
bug描述 Describe the Bug
使用Ampere GPU 或 Hopper GPU執行單測有多個錯誤 目前先整理 24 個錯誤: PaddlePaddle 2.6.0 buglist - part 1.xlsx
其他补充信息 Additional Supplementary Information
Paddle version: 2.6.0 Paddle With CUDA: True
OS: ubuntu 22.04 GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: N/A CMake version: version 3.25.1 Libc version: glibc 2.35 Python version: 3.10.12
CUDA version: 12.3.107 Build cuda_12.3.r12.3/compiler.33567101_0 cuDNN version: 8.9.7 Nvidia driver version: 535.129.03 Nvidia driver List: GPU 0: Tesla V100-SXM2-16GB GPU 1: Tesla V100-SXM2-16GB GPU 2: Tesla V100-SXM2-16GB GPU 3: Tesla V100-SXM2-16GB GPU 4: Tesla V100-SXM2-16GB GPU 5: Tesla V100-SXM2-16GB GPU 6: Tesla V100-SXM2-16GB GPU 7: Tesla V100-SXM2-16GB