使用rtdetr_focalnet_L_384_3x_coco 訓練会报错误

vscv commented 1 year ago

问题确认 Search before asking

[X] 我已经查询历史issue，没有发现相似的bug。I have searched the issues and found no similar bug report.

Bug组件 Bug Component

Training

Bug描述 Describe the Bug

AssertionError: Variable dtype not match, Variable [ conv2d_0.w_0 ] need tensor with dtype paddle.float32 but load tensor with dtype paddle.float16,

使用rtdetr_focalnet_L_384_3x_coco会报上述错误。

但使用rtdetr_r50vd_6x_coco則訓練正常。 python -m paddle.distributed.launch --gpus 0,1,2,3,4,5,6,7 tools/train.py -c configs/rtdetr/rtdetr_r50vd_6x_coco.yml --fleet --eval

复现环境 Environment

paddlepaddle-gpu==2.5.1.post120

$nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

$python3 -c "import platform;print(platform.architecture()[0]);print(platform.machine())"

64bit
x86_64

$python -c "import paddle; print(paddle.version)"

2.5.1

$python -c "import paddle; paddle.utils.run_check()"


Running verify PaddlePaddle program ... 
I0906 10:01:48.552414  1010 interpretercore.cc:237] New Executor is Running.
W0906 10:01:48.553376  1010 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.1, Runtime API Version: 12.0
W0906 10:01:48.554502  1010 gpu_resources.cc:149] device: 0, cuDNN Version: 8.9.
I0906 10:01:48.825856  1010 interpreter_util.cc:518] Standalone Executor is Used.
PaddlePaddle works well on 1 GPU.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='4', default_value='')
=======================================================================
I0906 10:01:50.359056  1065 tcp_utils.cc:107] Retry to connect to 127.0.0.1:57383 while the server is not yet listening.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
I0906 10:01:50.404660  1061 tcp_utils.cc:181] The server starts to listen on IP_ANY:57383
I0906 10:01:50.404807  1061 tcp_utils.cc:130] Successfully connected to 127.0.0.1:57383
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='2', default_value='')
=======================================================================
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='3', default_value='')
=======================================================================
I0906 10:01:50.426640  1063 tcp_utils.cc:130] Successfully connected to 127.0.0.1:57383
I0906 10:01:50.426784  1064 tcp_utils.cc:130] Successfully connected to 127.0.0.1:57383
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='5', default_value='')
=======================================================================
I0906 10:01:50.462639  1066 tcp_utils.cc:130] Successfully connected to 127.0.0.1:57383
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='6', default_value='')
=======================================================================
I0906 10:01:50.467439  1067 tcp_utils.cc:130] Successfully connected to 127.0.0.1:57383
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='1', default_value='')
=======================================================================
I0906 10:01:50.480422  1062 tcp_utils.cc:130] Successfully connected to 127.0.0.1:57383
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='7', default_value='')
=======================================================================
I0906 10:01:50.483410  1068 tcp_utils.cc:130] Successfully connected to 127.0.0.1:57383
I0906 10:01:53.359290  1065 tcp_utils.cc:130] Successfully connected to 127.0.0.1:57383
W0906 10:01:53.597221  1061 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.1, Runtime API Version: 12.0
W0906 10:01:53.598129  1061 gpu_resources.cc:149] device: 0, cuDNN Version: 8.9.
W0906 10:01:53.644883  1063 gpu_resources.cc:119] Please NOTE: device: 2, GPU Compute Capability: 7.0, Driver API Version: 12.1, Runtime API Version: 12.0
W0906 10:01:53.645078  1064 gpu_resources.cc:119] Please NOTE: device: 3, GPU Compute Capability: 7.0, Driver API Version: 12.1, Runtime API Version: 12.0
W0906 10:01:53.645150  1067 gpu_resources.cc:119] Please NOTE: device: 6, GPU Compute Capability: 7.0, Driver API Version: 12.1, Runtime API Version: 12.0
W0906 10:01:53.645287  1062 gpu_resources.cc:119] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 12.1, Runtime API Version: 12.0
W0906 10:01:53.645364  1068 gpu_resources.cc:119] Please NOTE: device: 7, GPU Compute Capability: 7.0, Driver API Version: 12.1, Runtime API Version: 12.0
W0906 10:01:53.645452  1066 gpu_resources.cc:119] Please NOTE: device: 5, GPU Compute Capability: 7.0, Driver API Version: 12.1, Runtime API Version: 12.0
W0906 10:01:53.645581  1065 gpu_resources.cc:119] Please NOTE: device: 4, GPU Compute Capability: 7.0, Driver API Version: 12.1, Runtime API Version: 12.0
W0906 10:01:53.646347  1063 gpu_resources.cc:149] device: 2, cuDNN Version: 8.9.
W0906 10:01:53.646521  1067 gpu_resources.cc:149] device: 6, cuDNN Version: 8.9.
W0906 10:01:53.646524  1064 gpu_resources.cc:149] device: 3, cuDNN Version: 8.9.
W0906 10:01:53.646550  1062 gpu_resources.cc:149] device: 1, cuDNN Version: 8.9.
W0906 10:01:53.646638  1068 gpu_resources.cc:149] device: 7, cuDNN Version: 8.9.
W0906 10:01:53.646798  1065 gpu_resources.cc:149] device: 4, cuDNN Version: 8.9.
W0906 10:01:53.646922  1066 gpu_resources.cc:149] device: 5, cuDNN Version: 8.9.
I0906 10:02:02.828073  1205 tcp_store.cc:273] receive shutdown event and so quit from MasterDaemon run loop
PaddlePaddle works well on 8 GPUs.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

$./build/all_reduce_perf -b 8 -e 256M -f 2 -g 8

# nThread 1 nGpus 8 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1240 on o4xu9khost480shared240-465n4 device  0 [0x1b] Tesla V100-SXM2-32GB
#  Rank  1 Group  0 Pid   1240 on o4xu9khost480shared240-465n4 device  1 [0x1c] Tesla V100-SXM2-32GB
#  Rank  2 Group  0 Pid   1240 on o4xu9khost480shared240-465n4 device  2 [0x3d] Tesla V100-SXM2-32GB
#  Rank  3 Group  0 Pid   1240 on o4xu9khost480shared240-465n4 device  3 [0x3e] Tesla V100-SXM2-32GB
#  Rank  4 Group  0 Pid   1240 on o4xu9khost480shared240-465n4 device  4 [0xb1] Tesla V100-SXM2-32GB
#  Rank  5 Group  0 Pid   1240 on o4xu9khost480shared240-465n4 device  5 [0xb2] Tesla V100-SXM2-32GB
#  Rank  6 Group  0 Pid   1240 on o4xu9khost480shared240-465n4 device  6 [0xdb] Tesla V100-SXM2-32GB
#  Rank  7 Group  0 Pid   1240 on o4xu9khost480shared240-465n4 device  7 [0xdc] Tesla V100-SXM2-32GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    34.65    0.00    0.00      0    34.24    0.00    0.00      0
          16             4     float     sum      -1    33.73    0.00    0.00      0    33.44    0.00    0.00      0
          32             8     float     sum      -1    33.80    0.00    0.00      0    33.66    0.00    0.00      0
          64            16     float     sum      -1    33.56    0.00    0.00      0    33.44    0.00    0.00      0
         128            32     float     sum      -1    34.20    0.00    0.01      0    33.80    0.00    0.01      0
         256            64     float     sum      -1    33.84    0.01    0.01      0    33.46    0.01    0.01      0
         512           128     float     sum      -1    33.44    0.02    0.03      0    33.32    0.02    0.03      0
        1024           256     float     sum      -1    33.31    0.03    0.05      0    33.40    0.03    0.05      0
        2048           512     float     sum      -1    33.61    0.06    0.11      0    33.50    0.06    0.11      0
        4096          1024     float     sum      -1    34.19    0.12    0.21      0    33.68    0.12    0.21      0
        8192          2048     float     sum      -1    33.95    0.24    0.42      0    33.95    0.24    0.42      0
       16384          4096     float     sum      -1    34.31    0.48    0.84      0    33.97    0.48    0.84      0
       32768          8192     float     sum      -1    34.40    0.95    1.67      0    34.33    0.95    1.67      0
       65536         16384     float     sum      -1    36.17    1.81    3.17      0    36.56    1.79    3.14      0
      131072         32768     float     sum      -1    40.79    3.21    5.62      0    40.83    3.21    5.62      0
      262144         65536     float     sum      -1    49.29    5.32    9.31      0    49.11    5.34    9.34      0
      524288        131072     float     sum      -1    57.44    9.13   15.97      0    57.54    9.11   15.94      0
     1048576        262144     float     sum      -1    79.43   13.20   23.10      0    78.56   13.35   23.36      0
     2097152        524288     float     sum      -1    105.4   19.90   34.83      0    105.3   19.91   34.85      0
     4194304       1048576     float     sum      -1    152.8   27.45   48.03      0    151.7   27.64   48.37      0
     8388608       2097152     float     sum      -1    253.5   33.09   57.91      0    250.8   33.44   58.52      0
    16777216       4194304     float     sum      -1    293.4   57.19  100.08      0    293.0   57.25  100.20      0
    33554432       8388608     float     sum      -1    503.5   66.64  116.63      0    503.0   66.71  116.74      0
    67108864      16777216     float     sum      -1    935.5   71.73  125.53      0    938.3   71.52  125.17      0
   134217728      33554432     float     sum      -1   1804.3   74.39  130.18      0   1807.3   74.27  129.97      0
   268435456      67108864     float     sum      -1   3574.2   75.10  131.43      0   3563.4   75.33  131.83      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 30.9912 
#

$python ppdet/modeling/tests/test_architectures.py
W0906 10:04:16.151448  1277 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.1, Runtime API Version: 12.0
W0906 10:04:16.152552  1277 gpu_resources.cc:149] device: 0, cuDNN Version: 8.9.
.......
----------------------------------------------------------------------
Ran 7 tests in 1.223s

OK

[重現報錯指令] $python -m paddle.distributed.launch --gpus 0,1,2,3,4,5,6,7 tools/train.py -c configs/rtdetr/rtdetr_focalnet_L_384_3x_coco.yml --fleet --eval

W0906 10:07:32.140421  1380 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.1, Runtime API Version: 12.0
W0906 10:07:32.141347  1380 gpu_resources.cc:149] device: 0, cuDNN Version: 8.9.
[2023-09-06 10:07:47,273] [    INFO] topology.py:266 - Total 1 data comm group(s) create successfully!
[2023-09-06 10:07:47,274] [    INFO] topology.py:266 - Total 8 model comm group(s) create successfully!
[2023-09-06 10:07:47,274] [    INFO] topology.py:266 - Total 8 pipe comm group(s) create successfully!
[2023-09-06 10:07:47,275] [    INFO] topology.py:266 - Total 8 sharding comm group(s) create successfully!
[2023-09-06 10:07:47,277] [    INFO] topology.py:217 - HybridParallelInfo: rank_id: 0, mp_degree: 1, sharding_degree: 1, pp_degree: 1, dp_degree: 8, mp_group: [0],  sharding_group: [0], pp_group: [0], dp_group: [0, 1, 2, 3, 4, 5, 6, 7], check/clip group: [0]
loading annotations into memory...
Done (t=12.55s)
creating index...
index created!
[09/06 10:08:21] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 200365, area: 0.0 x1: 296.65, y1: 388.33, x2: 297.67999999999995, y2: 388.33.
[09/06 10:08:58] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 550395, area: 0.0 x1: 9.98, y1: 188.56, x2: 15.52, y2: 188.56.
[09/06 10:09:02] ppdet.data.source.coco INFO: Load [117266 samples valid, 1021 samples invalid] in file dataset/coco/annotations/instances_train2017.json.

 in set_value
    assert (
AssertionError: Variable dtype not match, Variable [ conv2d_0.w_0 ] need tensor with dtype paddle.float32  but load tensor with dtype paddle.float16
I0906 10:09:07.606163  1532 tcp_store.cc:273] receive shutdown event and so quit from MasterDaemon run loop
LAUNCH INFO 2023-09-06 10:09:08,999 Pod failed
[2023-09-06 10:09:08,999] [    INFO] controller.py:115 - Pod failed
LAUNCH ERROR 2023-09-06 10:09:08,999 Container failed !!!
Container rank 2 status failed cmd ['/home/u3148947/2023-07-31_PaddleDetection/ppdet/bin/python', '-u', 'tools/train.py', '-c', 'configs/rtdetr/rtdetr_focalnet_L_384_3x_coco.yml', '--fleet', '--eval'] code 1 log log/workerlog.2 

[2023-09-06 10:09:09,001] [    INFO] controller.py:117 - ------------------------- ERROR LOG DETAIL -------------------------
23-09-06 10:07:47,275] [    INFO] topology.py:266 - Total 8 sharding comm group(s) create successfully!
[2023-09-06 10:07:47,277] [    INFO] topology.py:217 - HybridParallelInfo: rank_id: 2, mp_degree: 1, sharding_degree: 1, pp_degree: 1, dp_degree: 8, mp_group: [2],  sharding_group: [2], pp_group: [2], dp_group: [0, 1, 2, 3, 4, 5, 6, 7], check/clip group: [2]
loading annotations into memory...
Done (t=12.99s)
creating index...
index created!
Found an invalid bbox in annotations: im_id: 200365, area: 0.0 x1: 296.65, y1: 388.33, x2: 297.67999999999995, y2: 388.33.
Found an invalid bbox in annotations: im_id: 550395, area: 0.0 x1: 9.98, y1: 188.56, x2: 15.52, y2: 188.56.

 in set_value
    assert (
AssertionError: Variable dtype not match, Variable [ conv2d_0.w_0 ] need tensor with dtype paddle.float32  but load tensor with dtype paddle.float16
LAUNCH INFO 2023-09-06 10:09:09,805 Exit code 1
[2023-09-06 10:09:09,805] [    INFO] controller.py:149 - Exit code 1

Bug描述确认 Bug description confirmation

[X] 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息，确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR？ Are you willing to submit a PR?

[X] 我愿意提交PR！I'd like to help by submitting a PR!

MINGtoMING commented 1 year ago

@vscv 感觉是你没有开amp(混合精度)模式的缘故

vscv commented 1 year ago

@MINGtoMING 謝謝您的提醒，剛加了--amp參數，錯誤還是一樣的。

MINGtoMING commented 1 year ago

@vscv 我重新试了一下找到原因了：训练运行rtdetr_focalnet_L_384_3x_coco是会先加载FocalNet(focalnet_L_384_22k_fl4)的预训练权重，这个权重的dtype是float16，paddle不支持直接加载float16权重，是这个导致的报错，之后才会重新继续加载rtdetr_focalnet_L_384_3x在obj365上的预训练权重，这个权重是可以正常加载的。解决方法：不要加载FocalNet(focalnet_L_384_22k_fl4)的预训练权重，但如果设FocalNet::pretrained=None, 它还是会自动从默认链接下载并加载float16格式的权重，所以我们直接设FocalNet::pretrained=rtdetr_focalnet_L_384_3x在obj365上的预训练权重的链接，即可如下图：

MINGtoMING commented 1 year ago

FocalNet的其他scale的模型的预训练权重我随机检查了几个都是float32的，可能就这个focalnet_L_384_22k_fl4是float16

vscv commented 1 year ago

@MINGtoMING 謝謝大佬解惑。

JiFfeng-Yu commented 1 year ago

我改完之后会出现这个问题。。找遍了都找不到解决TypeError: init() got an unexpected keyword argument 'query_pos_head_inv_sig

PaddlePaddle / PaddleDetection