PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
21.66k stars 5.44k forks source link

希望官方吧ai studio的cuda版本提升一下,似乎paddle这边的concat有问题 #41861

Closed Wei-JL closed 1 year ago

Wei-JL commented 2 years ago

详情可以参考这个https://github.com/PaddlePaddle/Paddle/issues/41855 问题

paddle-bot-old[bot] commented 2 years ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

Wei-JL commented 2 years ago
Epoch:36/100
Total Loss: 9.121 || Val Loss: 5.917 
Start Train
Epoch 37/100:  14%|▉      | 4/29 [00:07<00:48,  1.96s/it, loss=6.42, lr=0.00094]
报错tensor为 : Tensor(shape=[1, 4], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
       [[0.        , 0.        , 0.21279770, 0.        ]]) ==========tensor的shape为 : [1, 4]

Traceback (most recent call last):
  File "train.py", line 252, in <module>
    epoch_step, epoch_step_val, gen, gen_val, end_epoch, Cuda)
  File "/home/aistudio/yolox_paddle/utils/utils_fit.py", line 39, in fit_one_epoch
    loss_value = yolo_loss(outputs, targets)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/aistudio/yolox_paddle/nets/yolo_training.py", line 146, in forward
    return self.get_losses(x_shifts, y_shifts, expanded_strides, labels, paddle.concat(outputs, 1))
  File "/home/aistudio/yolox_paddle/nets/yolo_training.py", line 252, in get_losses
    reg_targets, reg_targets_bool = concat_axis(reg_targets)
  File "/home/aistudio/yolox_paddle/nets/yolo_training.py", line 55, in concat_axis
    ten_tem = paddle.concat([ten_tem, ten], axis=0)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/tensor/manipulation.py", line 345, in concat
    return paddle.fluid.layers.concat(input=x, axis=axis, name=name)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/tensor.py", line 327, in concat
    return _C_ops.concat(input, 'axis', axis)
ValueError: (InvalidArgument) The shape of input[0] and input[1] is expected to be equal.But received input[0]'s shape = [112, 4], input[1]'s shape = [4].
  [Hint: Expected inputs_dims[i].size() == out_dims.size(), but received inputs_dims[i].size():1 != out_dims.size():2.] (at /paddle/paddle/fluid/operators/concat_op.h:40)
  [operator < concat > error]

第三行日志可以看见shape=[1, 4] 调用paddle.concat后自动变成shape = [4] 导致整个程序报错: ValueError: (InvalidArgument) The shape of input[0] and input[1] is expected to be equal.But received input[0]'s shape = [112, 4], input[1]'s shape = [4].

zhangting2020 commented 2 years ago

已联系相关同学,稍等

zhangting2020 commented 2 years ago

你先查看下本地环境的paddle版本,以及自己先check下AI Studio创建项目的时候用的哪个版本,看看2个是否一致,不一致的话,在AI Studio创建项目时,指定和自己本地环境相同版本的paddle,重新运行看看

Wei-JL commented 2 years ago

你先查看下本地环境的paddle版本,以及自己先check下AI Studio创建项目的时候用的哪个版本,看看2个是否一致,不一致的话,在AI Studio创建项目时,指定和自己本地环境相同版本的paddle,重新运行看看

本地环境是CUDA : 11.3 cudnn : 8.2.1 paddle-gpu 2.2.2 win11操作系统 ai studio 环境是CUDA : 10.1 cudnn : ? paddle-gpu 2.2.2

zhangting2020 commented 2 years ago

看你描述版本是一致的,这个报错和CUDA版本是没有关系的。确认2份代码完全一致吗?相同版本的Paddle应该不会出现不一样的结果。

Wei-JL commented 2 years ago

看你描述版本是一致的,这个报错和CUDA版本是没有关系的。确认2份代码完全一致吗?相同版本的Paddle应该不会出现不一样的结果。

对的 我是直接复制上去的。 我再上传一份试试

Wei-JL commented 2 years ago
Epoch:36/100
Total Loss: 9.121 || Val Loss: 5.917 
Start Train
Epoch 37/100:  14%|▉      | 4/29 [00:07<00:48,  1.96s/it, loss=6.42, lr=0.00094]
报错tensor为 : Tensor(shape=[1, 4], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
       [[0.        , 0.        , 0.21279770, 0.        ]]) ==========tensor的shape为 : [1, 4]

Traceback (most recent call last):
  File "train.py", line 252, in <module>
    epoch_step, epoch_step_val, gen, gen_val, end_epoch, Cuda)
  File "/home/aistudio/yolox_paddle/utils/utils_fit.py", line 39, in fit_one_epoch
    loss_value = yolo_loss(outputs, targets)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/aistudio/yolox_paddle/nets/yolo_training.py", line 146, in forward
    return self.get_losses(x_shifts, y_shifts, expanded_strides, labels, paddle.concat(outputs, 1))
  File "/home/aistudio/yolox_paddle/nets/yolo_training.py", line 252, in get_losses
    reg_targets, reg_targets_bool = concat_axis(reg_targets)
  File "/home/aistudio/yolox_paddle/nets/yolo_training.py", line 55, in concat_axis
    ten_tem = paddle.concat([ten_tem, ten], axis=0)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/tensor/manipulation.py", line 345, in concat
    return paddle.fluid.layers.concat(input=x, axis=axis, name=name)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/tensor.py", line 327, in concat
    return _C_ops.concat(input, 'axis', axis)
ValueError: (InvalidArgument) The shape of input[0] and input[1] is expected to be equal.But received input[0]'s shape = [112, 4], input[1]'s shape = [4].
  [Hint: Expected inputs_dims[i].size() == out_dims.size(), but received inputs_dims[i].size():1 != out_dims.size():2.] (at /paddle/paddle/fluid/operators/concat_op.h:40)
  [operator < concat > error]

第三行日志可以看见shape=[1, 4] 调用paddle.concat后自动变成shape = [4] 导致整个程序报错: ValueError: (InvalidArgument) The shape of input[0] and input[1] is expected to be equal.But received input[0]'s shape = [112, 4], input[1]'s shape = [4].

问题似乎再concat调用后 自动吧shape=[1, 4] 变成shape=[4] 然后报错

zhangting2020 commented 2 years ago
import paddle
paddle.__version__ 
paddle.utils.run_check()

用这种方式看看版本信息。

zhangting2020 commented 2 years ago

输入给concat的tensor是shape=[1,4]的吧?你在AI Studio上,可以试试,写一个测试脚本,只做concat,直接拿shape=[1,4]和[N, 4]这样的Tensor进行concat,看看是不是也报这个错误呢?

Wei-JL commented 2 years ago

image

import paddle paddle.version paddle.utils.run_check()

用这种方式看看版本信息。

Wei-JL commented 2 years ago

输入给concat的tensor是shape=[1,4]的吧?你在AI Studio上,可以试试,写一个测试脚本,只做concat,直接拿shape=[1,4]和[N, 4]这样的Tensor进行concat,看看是不是也报这个错误呢?

这个我试过了 不报错 再我上面的图片有

image

zhangting2020 commented 2 years ago

那看着不像是concat本身的问题。你在模型代码里,把concat的输入shape都打印出来,比如concat(x=[a,b,c]),在进行concat前,把a,b,c的shape都打印出来看看呢

Wei-JL commented 2 years ago

https://github.com/PaddlePaddle/Paddle/issues/41855 这个第一个就是所有的日志 且在input第11个的问题 第11个呢 正好是[1, 4]

Wei-JL commented 2 years ago

ValueError: (InvalidArgument) The shape of input[0] and input[10] is expected to be equal.But received input[0]'s shape = [171, 4], input[10]'s shape = [4]. 这个是报错信息,input[10] 从0开始数 第11个 可以ctrl+f 第11个就是[1, 4] image

zhangting2020 commented 2 years ago

试一下把concat的所有的输入保存到一个文件里,然后上传一下,我们看看是不是在这种输入下,单独调用concat会出错

Wei-JL commented 2 years ago

试一下把concat的所有的输入保存到一个文件里,然后上传一下,我们看看是不是在这种输入下,单独调用concat会出错

已经搞定了, 每个tenso都是[[ ... ]] 写入txt了 您试试看 res.txt

Wei-JL commented 2 years ago

试一下把concat的所有的输入保存到一个文件里,然后上传一下,我们看看是不是在这种输入下,单独调用concat会出错

已经搞定了, 每个tenso都是[[ ... ]] 写入txt了 您试试看 res.txt

您好我已经上传了 请问有这样的问题吗?

zhangting2020 commented 2 years ago

在#41855中,你贴的那些tensor打印出来的shape,没有[122, 4]和[4]的,但是后面你给出的报错信息是[112,4]和[4],我怀疑你在本地跑的时候恰好没有跑到这批数据。concat不会将输入的shape做任何改变,所以你需要本地再次运行确认下,你所有的数据里是不是有一个[4]的shape

ValueError: (InvalidArgument) The shape of input[0] and input[1] is expected to be equal.But received input[0]'s shape = [112, 4], input[1]'s shape = [4].
  [Hint: Expected inputs_dims[i].size() == out_dims.size(), but received inputs_dims[i].size():1 != out_dims.size():2.] (at /paddle/paddle/fluid/operators/concat_op.h:40)
  [operator < concat > error]
Wei-JL commented 2 years ago

112

您说的这个 “报错信息是[112,4]和[4]” 和上面一大堆日志不是同一个报错了已经~ 可能时间太久 描述不清了 您看我开源这个项目 您打开看看有什么问题可以吗

zhangting2020 commented 2 years ago

我的建议是首先在本地,至少训练1个epoch,确保你已经运行过数据集的全部数据。因为从你的报错信息来看,的确是concat的输入shape不对。我不太确定你在本地运行成功的那次,是否已经经过1个epoch?

Wei-JL commented 2 years ago

我的建议是首先在本地,至少训练1个epoch,确保你已经运行过数据集的全部数据。因为从你的报错信息来看,的确是concat的输入shape不对。我不太确定你在本地运行成功的那次,是否已经经过1个epoch?

本地跑了15epoch 确实没有问题 本地15轮我需要跑20分钟左右 也没有报错

zhangting2020 commented 2 years ago

2边的数据完全一样吗?因为ai studio本质上和你本地跑不应该有什么区别,尤其是在版本也一致的情况下,出现这个报错比较奇怪。

Wei-JL commented 2 years ago

2边的数据完全一样吗?因为ai studio本质上和你本地跑不应该有什么区别,尤其是在版本也一致的情况下,出现这个报错比较奇怪。

对的, 我是本地上传的 武大公开数据集。我也感觉奇怪, 解决了好久 改了各种写concat的方法,甚至都for循环list concat了。。。 所以才上传的

Wei-JL commented 2 years ago

2边的数据完全一样吗?因为ai studio本质上和你本地跑不应该有什么区别,尤其是在版本也一致的情况下,出现这个报错比较奇怪。

对的, 我是本地上传的 武大公开数据集。我也感觉奇怪, 解决了好久 改了各种写concat的方法,甚至都for循环list concat了。。。 所以才上传的

所以才来提issue的

zhangting2020 commented 2 years ago

在ai studio上,先跑一个epoch,把这条[4]对应的数据能否找出来,先去掉呢?

Wei-JL commented 2 years ago

在ai studio上,先跑一个epoch,把这条[4]对应的数据能否找出来,先去掉呢?

我输出了,list里面的tensor 确实没有shape=[4]的但是只要出现shape=[1, 4]就会报错 因为concat似乎自动给我吧shape=[1, 4]转为了shape=[1, 4]

zhangting2020 commented 2 years ago

在ai studio上尝试装一下最新版本呢?按照这个文档,安装nigthly build的版本。https://www.paddlepaddle.org.cn/

另外:上传的res读取后是个字符串,复现起来比较麻烦。你自己有尝试过吗?可以自己尝试下,先把concat的这个输入解析后,单独用concat操作,看是否能复现。

Wei-JL commented 2 years ago

在ai studio上尝试装一下最新版本呢?按照这个文档,安装nigthly build的版本。https://www.paddlepaddle.org.cn/

另外:上传的res读取后是个字符串,复现起来比较麻烦。你自己有尝试过吗?可以自己尝试下,先把concat的这个输入解析后,单独用concat操作,看是否能复现。

2.2.2应该是最新版本吧 我重新上传一份代码试试吧

zhangting2020 commented 2 years ago

在ai studio上尝试装一下最新版本呢?按照这个文档,安装nigthly build的版本。https://www.paddlepaddle.org.cn/ 另外:上传的res读取后是个字符串,复现起来比较麻烦。你自己有尝试过吗?可以自己尝试下,先把concat的这个输入解析后,单独用concat操作,看是否能复现。

2.2.2应该是最新版本吧 我重新上传一份代码试试吧

最新的是nightly build,这个是develop版本的包。

重新上传代码后情况如何呢?

Wei-JL commented 2 years ago

在ai studio上尝试装一下最新版本呢?按照这个文档,安装nigthly build的版本。https://www.paddlepaddle.org.cn/ 另外:上传的res读取后是个字符串,复现起来比较麻烦。你自己有尝试过吗?可以自己尝试下,先把concat的这个输入解析后,单独用concat操作,看是否能复现。

2.2.2应该是最新版本吧 我重新上传一份代码试试吧

最新的是nightly build,这个是develop版本的包。

重新上传代码后情况如何呢?

FGM6}S6N7`5`CWX%Y4}A113

这是我在本地用paddle跑的, 可以运行很多轮 我跑了接近40轮,手动停了。 但是重新复制到BML上就报错 这次报错换了。

File "/home/aistudio/yolox_paddle/nets/yolo_training.py", line 282, in get_assignments
    y_shifts, total_num_anchors, num_gt)
  File "/home/aistudio/yolox_paddle/nets/yolo_training.py", line 529, in get_in_boxes_info
    is_in_boxes.astype("int").gather(idx, axis=1).astype("bool").logical_and(
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py", line 106, in astype
    return _C_ops.cast(self, 'in_dtype', self.dtype, 'out_dtype', dtype)
ValueError: (InvalidArgument) element count should be greater than 0, but received value is: 0.
  [Hint: Expected element_count > 0, but received element_count:0 <= 0:0.] (at /paddle/paddle/fluid/platform/gpu_launch_config.h:68)
  [operator < cast > error]
zhangting2020 commented 2 years ago

在ai studio上用这个打印出的版本是什么?在前面的的comment,好像截图里没有看到版本信息。可能是最开始的给的那个命令,显示的有点问题,你重新看看这个命令,确认下版本?总觉得不太像是同一个版本运行出来的效果。

import paddle
paddle.__version__ 
Wei-JL commented 2 years ago

在ai studio上用这个打印出的版本是什么?在前面的的comment,好像截图里没有看到版本信息。可能是最开始的给的那个命令,显示的有点问题,你重新看看这个命令,确认下版本?总觉得不太像是同一个版本运行出来的效果。

import paddle
paddle.__version__ 

如下, image

Wei-JL commented 2 years ago
W0420 16:36:03.499199  7090 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W0420 16:36:03.503062  7090 device_context.cc:465] device: 0, cuDNN Version: 7.6.
initialize network with normal astype
Load weights pretrain_models/pre.pdparams.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py:596: UserWarning: The program will return to single-card operation. Please check 1, whether you use spawn or fleetrun to start the program. 2, Whether it is a multi-card program. 3, Is the current environment multi-card.
  warnings.warn("The program will return to single-card operation. "
Start Train
Epoch 36/100:   0%|                       | 0/715 [00:00<?, ?it/s<class 'dict'>]/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/norm.py:653: UserWarning: When training, we now always track global mean and variance.
  "When training, we now always track global mean and variance.")
Epoch 36/100:  55%|██▋  | 392/715 [04:41<03:51,  1.39it/s, loss=8.11, lr=0.0001]
Traceback (most recent call last):
  File "train.py", line 248, in <module>
    epoch_step, epoch_step_val, gen, gen_val, end_epoch, Cuda)
  File "/home/aistudio/yolox_paddle/utils/utils_fit.py", line 39, in fit_one_epoch
    loss_value = yolo_loss(outputs, targets)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/aistudio/yolox_paddle/nets/yolo_training.py", line 145, in forward
    return self.get_losses(x_shifts, y_shifts, expanded_strides, labels, paddle.concat(outputs, 1))
  File "/home/aistudio/yolox_paddle/nets/yolo_training.py", line 222, in get_losses
    expanded_strides, x_shifts, y_shifts,
  File "<decorator-gen-287>", line 2, in get_assignments
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 351, in _decorate_function
    return func(*args, **kwargs)
  File "/home/aistudio/yolox_paddle/nets/yolo_training.py", line 282, in get_assignments
    y_shifts, total_num_anchors, num_gt)
  File "/home/aistudio/yolox_paddle/nets/yolo_training.py", line 529, in get_in_boxes_info
    is_in_boxes.astype("int").gather(idx, axis=1).astype("bool").logical_and(
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py", line 106, in astype
    return _C_ops.cast(self, 'in_dtype', self.dtype, 'out_dtype', dtype)
ValueError: (InvalidArgument) element count should be greater than 0, but received value is: 0.
  [Hint: Expected element_count > 0, but received element_count:0 <= 0:0.] (at /paddle/paddle/fluid/platform/gpu_launch_config.h:68)
  [operator < cast > error]

重新上传的完整报错

Wei-JL commented 2 years ago

https://aistudio.baidu.com/studio/project/partial/verify/3801194/3e187b1358cf478eb6eab382d0a932ac 项目fork地址,进去重启全部运行就可以看见我发的错误了~~~ 谢谢

Wei-JL commented 2 years ago

在ai studio上用这个打印出的版本是什么?在前面的的comment,好像截图里没有看到版本信息。可能是最开始的给的那个命令,显示的有点问题,你重新看看这个命令,确认下版本?总觉得不太像是同一个版本运行出来的效果。

import paddle
paddle.__version__ 

您好~这个问题卡太久了 请问有什么方式解决吗 在本地跑确实也是跑起来了 为什么放到bml上就不行了呢? 报错大部分都是因为出现了某一个tensor第一维度为0 或者 1

zhangting2020 commented 2 years ago

咨询了AI studio的同事,他们认为平台的环境应该没有问题。上面提到重新上传了一次代码,报错又不一样了。我始终觉得可能是上传到BML的数据或者代码与本地不一样,这个你有方法去检查一致性吗?

Wei-JL commented 2 years ago

咨询了AI studio的同事,他们认为平台的环境应该没有问题。上面提到重新上传了一次代码,报错又不一样了。我始终觉得可能是上传到BML的数据或者代码与本地不一样,这个你有方法去检查一致性吗?

您好~ 我昨天也有怀疑这一点 ,所以我昨天把本地的数据也上传了,报错依旧一样。如下

 File "/home/aistudio/yolox_paddle/nets/yolo_training.py", line 529, in get_in_boxes_info
    is_in_boxes.astype("int").gather(idx, axis=1).astype("bool").logical_and(
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py", line 106, in astype
    return _C_ops.cast(self, 'in_dtype', self.dtype, 'out_dtype', dtype)
ValueError: (InvalidArgument) element count should be greater than 0, but received value is: 0.
  [Hint: Expected element_count > 0, but received element_count:0 <= 0:0.] (at /paddle/paddle/fluid/platform/gpu_launch_config.h:68)
  [operator < cast > error]

之前的concat的错误是因为我当时觉得concat内部会自动把shape=[1, 4]变成[4] ,这次我又没重写concat了 直接用的paddle.concat

paddle-bot[bot] commented 1 year ago

Since you haven\'t replied for more than a year, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. 由于您超过一年未回复,我们将关闭这个issue/pr。 若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。