【Hackathon 5th No.34】为 Paddle 新增 bitwise_right_shift / bitwise_right_shift_ / bitwise_left_shift / bitwise_left_shift_ API -part

PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

http://www.paddlepaddle.org/

Apache License 2.0

21.66k stars 5.44k forks source link

【Hackathon 5th No.34】为 Paddle 新增 bitwise_right_shift / bitwise_right_shift_ / bitwise_left_shift / bitwise_left_shift_ API -part #58092

Closed cocoshe closed 4 months ago

cocoshe commented 7 months ago

PR types

New features

PR changes

APIs

Description

RFC：

https://github.com/PaddlePaddle/community/pull/665

更新RFC：

https://github.com/PaddlePaddle/community/pull/788

中文文档：

https://github.com/PaddlePaddle/docs/pull/6399

cocoshe commented 6 months ago

@xuxinyi389 可以抽空帮忙看看嘛？

为啥本地test可以过，但是在CI-PR3的环境确过不了呢？我好像复现不了这个bug，不知道从哪里可以入手本地： CI：https://xly.bce.baidu.com/paddlepaddle/paddle/newipipe/detail/9540790/job/24456843 感谢~

xuxinyi389 commented 6 months ago

@xuxinyi389 可以抽空帮忙看看嘛？

为啥本地test可以过，但是在CI-PR3的环境确过不了呢？我好像复现不了这个bug，不知道从哪里可以入手本地： CI：https://xly.bce.baidu.com/paddlepaddle/paddle/newipipe/detail/9540790/job/24456843 感谢~

你写的test在目前过的流水线好像都没运行过，应该不是环境的问题。你本地的测试通过，可以提交一份本地的测试报告。

cocoshe commented 6 months ago

@xuxinyi389 可以抽空帮忙看看嘛？为啥本地test可以过，但是在CI-PR3的环境确过不了呢？我好像复现不了这个bug，不知道从哪里可以入手本地： CI：https://xly.bce.baidu.com/paddlepaddle/paddle/newipipe/detail/9540790/job/24456843 感谢~

你写的test在目前过的流水线好像都没运行过，应该不是环境的问题。你本地的测试通过，可以提交一份本地的测试报告。

我看到在PR-CI-Windows-Inference里面是没运行到，但是在PR-CI-Py3里面应该是运行过但是失败了，我在上个commit中加入了一些print用来打印：两个输入、一个输出、一个reference，不太清楚为什么这个ci里面的输出错的离谱的感觉：比如这个样例，在我本地（编译后安装）：

CYU(NO~WH%4GEJL@ERC6CQ 看上去是正常的，但是下面是ci中计算的输出out是这样的这可能是什么原因呢？

xuxinyi389 commented 6 months ago

我之前遇到过静态图变量泄漏，因为没有用好静态图的上下文，存在同名变量导致fetch错。但是我看你的动态图测试也没通过。你的测试没有在任何一个CI上通过，应该和环境没关系，你自己再定位下错误。

cocoshe commented 6 months ago

我之前遇到过静态图变量泄漏，因为没有用好静态图的上下文，存在同名变量导致fetch错。但是我看你的动态图测试也没通过。你的测试没有在任何一个CI上通过，应该和环境没关系，你自己再定位下错误。

试了下是CPUPlace的时候，遇到需要broadcast的时候结果会出现问题（之前本地我都是开的GPU，broadcast正常，我上面将broadcast的case注释掉了，然后就可以过ci）我看了相关实现应该是在funcs::ElementwiseCompute

我看这里的ElementwiseCompute的描述

// It is a common CPU implementation to compute binary calculation with the // support of broadcast. Note: // 1. CPU implementation cannot support the case when x needs broadcast, thus // this function need to be called with XxxFunctor and XxxInverseFunctor, // like AddFunctor and InverseAddFunctor. // 2. The corresponding GPU implementation supports all the broadcast cases, // thus there is no need to define and call with XxxInverseFunctor. // TODO(liuyiqun): optimize the CPU implementation to support all broadcast // cases and avoid the need of XxxInverseFunctor.

bitwise的broadcast应该是在CPU和GPU下有不同的实现？理论上所有bitwise算子应该都进行了自动的broadcast，既然在GPU下的broadcast样例没问题而在CPU下的broadcast样例有问题，那么我感觉我可能是这个地方出了问题。为这次和以后可能遇到的类似问题，我想请教一下这种python和cpp调用混合在一起的情况下，一般是如何debug的呢？（我一直都是纯python的时候用pdb，然后纯cpp的时候用gdb。但是python中调用_c_ops的时候不知道如何能进去看代码执行情况，所以想问问这种有没有什么推荐的debug方法呢？）

xuxinyi389 commented 6 months ago

我之前遇到过静态图变量泄漏，因为没有用好静态图的上下文，存在同名变量导致fetch错。但是我看你的动态图测试也没通过。你的测试没有在任何一个CI上通过，应该和环境没关系，你自己再定位下错误。

试了下是CPUPlace的时候，遇到需要broadcast的时候结果会出现问题（之前本地我都是开的GPU，broadcast正常，我上面将broadcast的case注释掉了，然后就可以过ci）我看了相关实现应该是在funcs::ElementwiseCompute

我看这里的ElementwiseCompute的描述

// It is a common CPU implementation to compute binary calculation with the // support of broadcast. Note: // 1. CPU implementation cannot support the case when x needs broadcast, thus // this function need to be called with XxxFunctor and XxxInverseFunctor, // like AddFunctor and InverseAddFunctor. // 2. The corresponding GPU implementation supports all the broadcast cases, // thus there is no need to define and call with XxxInverseFunctor. // TODO(liuyiqun): optimize the CPU implementation to support all broadcast // cases and avoid the need of XxxInverseFunctor.

bitwise的broadcast应该是在CPU和GPU下有不同的实现？理论上所有bitwise算子应该都进行了自动的broadcast，既然在GPU下的broadcast样例没问题而在CPU下的broadcast样例有问题，那么我感觉我可能是这个地方出了问题。为这次和以后可能遇到的类似问题，我想请教一下这种python和cpp调用混合在一起的情况下，一般是如何debug的呢？（我一直都是纯python的时候用pdb，然后纯cpp的时候用gdb。但是python中调用_c_ops的时候不知道如何能进去看代码执行情况，所以想问问这种有没有什么推荐的debug方法呢？）

在调用逻辑较为复杂时，为了定位出bug的代码行数，C++的调试可以在源码内添加一些日志输出，然后设置GLOG_v的级别，可以更直观的定位。

luotao1 commented 6 months ago

请通过下覆盖率测试

cocoshe commented 5 months ago

@xuxinyi389 查明白之前的问题了，是在CPU下x和y在broadcast的时候，会将x和y中len(shape)更大的作为x，更小的作为y，然后再做broadcast，所以当len(x.shape)比len(y.shape)小的时候，原本的x>>y就变成了y>>x，需要补充InverseXXX来解决（看上去是个历史遗留todo，未来应该会适应）

然后现在遇到一个问题，就是支持的类型是uint8, int8, int16, int32, int64，但是之前调研的JAX中是组合cast和算术位移来实现逻辑位移，然而它的基础类型是numpy，支持uint16, uint32, uint64这种，但是paddle目前cast api并不支持uint16, uint32, uint64与int16, int32, int64的转换；我也想过用paddle.where来给有符号负数的位置强行转换为无符号数，也就是符号位先置零（主要是要解决有符号数逻辑右移的情况），然后shift后手动放置符号位，但是发现paddle.where不支持int16，所以感觉在python层实现可能还需要更强的类型支持

然后我尝试在cpp层，想要在底层bitwise计算的时候加入一个bool判断算术位移还是逻辑位移，但是发现funcs::ElementwiseCompute写的比较特定，不太方便拓展出一个bool参数来计算

想问一下有什么建议嘛？emmm感觉有点头大（思考算术位移的必要性emmmm

luotao1 commented 5 months ago

但是paddle目前cast api并不支持uint16, uint32, uint64与int16, int32, int64的转换；我也想过用paddle.where来给有符号负数的位置强行转换为无符号数，也就是符号位先置零（主要是要解决有符号数逻辑右移的情况），然后shift后手动放置符号位，但是发现paddle.where不支持int16，所以感觉在python层实现可能还需要更强的类型支持

是否可以给cast api和paddle.where 补充下类型支持？

cocoshe commented 5 months ago

但是paddle目前cast api并不支持uint16, uint32, uint64与int16, int32, int64的转换；我也想过用paddle.where来给有符号负数的位置强行转换为无符号数，也就是符号位先置零（主要是要解决有符号数逻辑右移的情况），然后shift后手动放置符号位，但是发现paddle.where不支持int16，所以感觉在python层实现可能还需要更强的类型支持

是否可以给cast api和paddle.where 补充下类型支持？

ok我尝试一下~

cocoshe commented 5 months ago

但是paddle目前cast api并不支持uint16, uint32, uint64与int16, int32, int64的转换；我也想过用paddle.where来给有符号负数的位置强行转换为无符号数，也就是符号位先置零（主要是要解决有符号数逻辑右移的情况），然后shift后手动放置符号位，但是发现paddle.where不支持int16，所以感觉在python层实现可能还需要更强的类型支持

是否可以给cast api和paddle.where 补充下类型支持？

看了一下paddle中似乎不支持uint16, uint32, uint64的dtype吗？之前一些类型拓展我看到普遍都是添加对一些复数和bfloat这种类型的支持，其他同学补充的unsigned支持也是唯一支持的unsigned类型(uint8)。想问下想要支持其他unsigned类型有什么可以参考的例子吗？

>>> paddle.to_tensor([1,2,3], dtype=paddle.int16)
Tensor(shape=[3], dtype=int16, place=Place(cpu), stop_gradient=True,
       [1, 2, 3])
>>> paddle.to_tensor([1,2,3], dtype=paddle.uint16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'paddle' has no attribute 'uint16'. Did you mean: 'int16'?
>>> paddle.to_tensor([1,2,3], dtype=paddle.uint32)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'paddle' has no attribute 'uint32'. Did you mean: 'int32'?
>>> paddle.to_tensor([1,2,3], dtype=paddle.uint64)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'paddle' has no attribute 'uint64'. Did you mean: 'int64'?

xuxinyi389 commented 5 months ago

@xuxinyi389 查明白之前的问题了，是在CPU下x和y在broadcast的时候，会将x和y中len(shape)更大的作为x，更小的作为y，然后再做broadcast，所以当len(x.shape)比len(y.shape)小的时候，原本的x>>y就变成了y>>x，需要补充InverseXXX来解决（看上去是个历史遗留todo，未来应该会适应）

然后现在遇到一个问题，就是支持的类型是uint8, int8, int16, int32, int64，但是之前调研的JAX中是组合cast和算术位移来实现逻辑位移，然而它的基础类型是numpy，支持uint16, uint32, uint64这种，但是paddle目前cast api并不支持uint16, uint32, uint64与int16, int32, int64的转换；我也想过用paddle.where来给有符号负数的位置强行转换为无符号数，也就是符号位先置零（主要是要解决有符号数逻辑右移的情况），然后shift后手动放置符号位，但是发现paddle.where不支持int16，所以感觉在python层实现可能还需要更强的类型支持

然后我尝试在cpp层，想要在底层bitwise计算的时候加入一个bool判断算术位移还是逻辑位移，但是发现funcs::ElementwiseCompute写的比较特定，不太方便拓展出一个bool参数来计算

想问一下有什么建议嘛？emmm感觉有点头大（思考算术位移的必要性emmmm

paddle cast如果支持 uint 向 int转换，在超出int表示范围时会损失精度，这样的转换是不合理的。

cocoshe commented 5 months ago

@xuxinyi389 查明白之前的问题了，是在CPU下x和y在broadcast的时候，会将x和y中len(shape)更大的作为x，更小的作为y，然后再做broadcast，所以当len(x.shape)比len(y.shape)小的时候，原本的x>>y就变成了y>>x，需要补充InverseXXX来解决（看上去是个历史遗留todo，未来应该会适应）然后现在遇到一个问题，就是支持的类型是uint8, int8, int16, int32, int64，但是之前调研的JAX中是组合cast和算术位移来实现逻辑位移，然而它的基础类型是numpy，支持uint16, uint32, uint64这种，但是paddle目前cast api并不支持uint16, uint32, uint64与int16, int32, int64的转换；我也想过用paddle.where来给有符号负数的位置强行转换为无符号数，也就是符号位先置零（主要是要解决有符号数逻辑右移的情况），然后shift后手动放置符号位，但是发现paddle.where不支持int16，所以感觉在python层实现可能还需要更强的类型支持然后我尝试在cpp层，想要在底层bitwise计算的时候加入一个bool判断算术位移还是逻辑位移，但是发现funcs::ElementwiseCompute写的比较特定，不太方便拓展出一个bool参数来计算想问一下有什么建议嘛？emmm感觉有点头大（思考算术位移的必要性emmmm

paddle cast如果支持 uint 向 int转换，在超出int表示范围时会损失精度，这样的转换是不合理的。

辛苦review~，嗯嗯有道理的，那这样看来似乎竞品都没有可以参考的实现方案。我感觉想要复用大部分bitwise那一套代码的话，是不是可以干脆把算术算数位移和逻辑位移拆开（因为额外的bool参数似乎不太方便融合到bitwise那一套体系中）。比如原来的bitwise_right_shift(x, y, is_arithmetic)拆成bitwise_right_arithmetic_shift(x, y)和bitwise_right_logic_shift(x, y)，然后最后在

https://github.com/PaddlePaddle/Paddle/pull/58092/files#diff-9852da3163fbef8a980df35d58e40d115076ed3eb511b020a76963c03d68a2c6R60-R63

template <typename T>
struct BitwiseRightShiftFunctor {
  HOSTDEVICE T operator()(const T a, const T b) const { return a >> b; }
};

这个地方，分别对uint8和其他int类型进行处理，主要针对有符号数且为负数，且进行算术右移的时候，存一下符号位然后符号位置零，shift之后放置符号位（大致这样）

或者有什么其他比较好的实现方法嘛？希望大佬指点指点~

xuxinyi389 commented 5 months ago

@xuxinyi389 查明白之前的问题了，是在CPU下x和y在broadcast的时候，会将x和y中len(shape)更大的作为x，更小的作为y，然后再做broadcast，所以当len(x.shape)比len(y.shape)小的时候，原本的x>>y就变成了y>>x，需要补充InverseXXX来解决（看上去是个历史遗留todo，未来应该会适应）然后现在遇到一个问题，就是支持的类型是uint8, int8, int16, int32, int64，但是之前调研的JAX中是组合cast和算术位移来实现逻辑位移，然而它的基础类型是numpy，支持uint16, uint32, uint64这种，但是paddle目前cast api并不支持uint16, uint32, uint64与int16, int32, int64的转换；我也想过用paddle.where来给有符号负数的位置强行转换为无符号数，也就是符号位先置零（主要是要解决有符号数逻辑右移的情况），然后shift后手动放置符号位，但是发现paddle.where不支持int16，所以感觉在python层实现可能还需要更强的类型支持然后我尝试在cpp层，想要在底层bitwise计算的时候加入一个bool判断算术位移还是逻辑位移，但是发现funcs::ElementwiseCompute写的比较特定，不太方便拓展出一个bool参数来计算想问一下有什么建议嘛？emmm感觉有点头大（思考算术位移的必要性emmmm

paddle cast如果支持 uint 向 int转换，在超出int表示范围时会损失精度，这样的转换是不合理的。

辛苦review~，嗯嗯有道理的，那这样看来似乎竞品都没有可以参考的实现方案。我感觉想要复用大部分bitwise那一套代码的话，是不是可以干脆把算术算数位移和逻辑位移拆开（因为额外的bool参数似乎不太方便融合到bitwise那一套体系中）。比如原来的bitwise_right_shift(x, y, is_arithmetic)拆成bitwise_right_arithmetic_shift(x, y)和bitwise_right_logic_shift(x, y)，然后最后在

https://github.com/PaddlePaddle/Paddle/pull/58092/files#diff-9852da3163fbef8a980df35d58e40d115076ed3eb511b020a76963c03d68a2c6R60-R63
template <typename T>
struct BitwiseRightShiftFunctor {
  HOSTDEVICE T operator()(const T a, const T b) const { return a >> b; }
};
这个地方，分别对uint8和其他int类型进行处理，主要针对有符号数且为负数，且进行算术右移的时候，存一下符号位然后符号位置零，shift之后放置符号位（大致这样）

或者有什么其他比较好的实现方法嘛？希望大佬指点指点~

可以分开实现的。你可以尝试下在python层抽象一层，对用户提供例如bitwise_right_shift(x, y, is_arithmetic)这样的接口，而在底层拆开实现

cocoshe commented 5 months ago

@xuxinyi389 查明白之前的问题了，是在CPU下x和y在broadcast的时候，会将x和y中len(shape)更大的作为x，更小的作为y，然后再做broadcast，所以当len(x.shape)比len(y.shape)小的时候，原本的x>>y就变成了y>>x，需要补充InverseXXX来解决（看上去是个历史遗留todo，未来应该会适应）然后现在遇到一个问题，就是支持的类型是uint8, int8, int16, int32, int64，但是之前调研的JAX中是组合cast和算术位移来实现逻辑位移，然而它的基础类型是numpy，支持uint16, uint32, uint64这种，但是paddle目前cast api并不支持uint16, uint32, uint64与int16, int32, int64的转换；我也想过用paddle.where来给有符号负数的位置强行转换为无符号数，也就是符号位先置零（主要是要解决有符号数逻辑右移的情况），然后shift后手动放置符号位，但是发现paddle.where不支持int16，所以感觉在python层实现可能还需要更强的类型支持然后我尝试在cpp层，想要在底层bitwise计算的时候加入一个bool判断算术位移还是逻辑位移，但是发现funcs::ElementwiseCompute写的比较特定，不太方便拓展出一个bool参数来计算想问一下有什么建议嘛？emmm感觉有点头大（思考算术位移的必要性emmmm

paddle cast如果支持 uint 向 int转换，在超出int表示范围时会损失精度，这样的转换是不合理的。

辛苦review~，嗯嗯有道理的，那这样看来似乎竞品都没有可以参考的实现方案。我感觉想要复用大部分bitwise那一套代码的话，是不是可以干脆把算术算数位移和逻辑位移拆开（因为额外的bool参数似乎不太方便融合到bitwise那一套体系中）。比如原来的bitwise_right_shift(x, y, is_arithmetic)拆成bitwise_right_arithmetic_shift(x, y)和bitwise_right_logic_shift(x, y)，然后最后在 https://github.com/PaddlePaddle/Paddle/pull/58092/files#diff-9852da3163fbef8a980df35d58e40d115076ed3eb511b020a76963c03d68a2c6R60-R63
template <typename T>
struct BitwiseRightShiftFunctor {
  HOSTDEVICE T operator()(const T a, const T b) const { return a >> b; }
};
这个地方，分别对uint8和其他int类型进行处理，主要针对有符号数且为负数，且进行算术右移的时候，存一下符号位然后符号位置零，shift之后放置符号位（大致这样） 或者有什么其他比较好的实现方法嘛？希望大佬指点指点~
可以分开实现的。你可以尝试下在python层抽象一层，对用户提供例如bitwise_right_shift(x, y, is_arithmetic)这样的接口，而在底层拆开实现

ok明白了

cocoshe commented 5 months ago

@xuxinyi389 我注意到coverage的ci中cpu下的kernel的四个DEFINE_BITWISE_KERNEL_WITH_INVERSE宏没覆盖到

https://xly.bce.baidu.com/paddlepaddle/paddle/newipipe/detail/9736567/job/24767260

然后我尝试在这个宏加了一个log：

#define DEFINE_BITWISE_KERNEL_WITH_INVERSE(op_type)                         \
  template <typename T, typename Context>                                   \
  void Bitwise##op_type##Kernel(const Context& dev_ctx,                     \
                                const DenseTensor& x,                       \
                                const DenseTensor& y,                       \
                                DenseTensor* out) {                         \
    LOG(INFO) << "I am in the .cc kernel!!!!!!";                            \    ######### 打了这一行log ###########
    funcs::Bitwise##op_type##Functor<T> func;                               \
    funcs::InverseBitwise##op_type##Functor<T> inv_func;                    \
    auto x_dims = x.dims();                                                 \
    auto y_dims = y.dims();                                                 \
    if (x_dims.size() >= y_dims.size()) {                                   \
      funcs::ElementwiseCompute<funcs::Bitwise##op_type##Functor<T>, T>(    \
          dev_ctx, x, y, func, out);                                        \
    } else {                                                                \
      funcs::ElementwiseCompute<funcs::InverseBitwise##op_type##Functor<T>, \
                                T>(dev_ctx, x, y, inv_func, out);           \
    }                                                                       \
  }

然后把基类unittest的setup中的place写死成cpu

class TestBitwiseLeftShiftAPI(unittest.TestCase):
    def setUp(self):
        self.init_input()
        self.place = (
            # paddle.CUDAPlace(0)
            # if paddle.is_compiled_with_cuda()
            # else paddle.CPUPlace()
            paddle.CPUPlace()
        )

最后能过test，并且走的是cpu版本的kernel，下面是log：

λ xxxy /home/pd/Paddle/build ctest -R test_bitwise_shift -V
UpdateCTestConfiguration  from :/home/pd/Paddle/build/DartConfiguration.tcl
UpdateCTestConfiguration  from :/home/pd/Paddle/build/DartConfiguration.tcl
Test project /home/pd/Paddle/build
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 742
    Start 742: test_bitwise_shift_op

742: Test command: /home/cmake-3.18.0-Linux-x86_64/bin/cmake "-E" "env" "PYTHONPATH=/home/pd/Paddle/build/python" "/usr/bin/python" "/home/pd/Paddle/tools/test_runner.py" "test_bitwise_shift_op"
742: Test timeout computed to be: 10000000
742: grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
742: W1217 04:08:01.149598 94391 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 12.3, Runtime API Version: 12.0
742: W1217 04:08:01.150851 94391 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8.
742: I1217 04:08:01.243474 94391 program_interpreter.cc:214] New Executor is Running.
742: I1217 04:08:01.243816 94391 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.250216 94391 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.258257 94391 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.265053 94391 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.272917 94391 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.278879 94391 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.286705 94391 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.292797 94391 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.301602 94391 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.307641 94391 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.315480 94391 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.321446 94391 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.329121 94391 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.335136 94391 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.343986 94391 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.350009 94391 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.357797 94391 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.363785 94391 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.371469 94391 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.380086 94391 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.390872 94391 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.396921 94391 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.404939 94391 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.410903 94391 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.418599 94391 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.424595 94391 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.433780 94391 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.439796 94391 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.447381 94391 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.453291 94391 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.460729 94391 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.466603 94391 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.475275 94391 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
742: I1217 04:08:01.481164 94391 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
1/1 Test #742: test_bitwise_shift_op ............   Passed    5.06 sec

The following tests passed:
        test_bitwise_shift_op

100% tests passed, 0 tests failed out of 1

Total Test time (real) =   5.24 sec

此外，我也安装了whl包验证了一下：

λ xxxy /home/pd/Paddle/build python
Python 3.9.18 (main, Aug 25 2023, 13:20:04) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import paddle
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
>>> x = paddle.to_tensor([], dtype=paddle.int64)
>>> x = paddle.to_tensor([10,20,30], dtype=paddle.int64, place=paddle.CPUPlace())
>>> y = paddle.to_tensor([1,2,3], dtype=paddle.int64, place=paddle.CPUPlace())
>>> paddle.bitwise_left_shift(x, y, is_arithmetic=True)
I1217 04:15:22.898773 94440 bitwise_kernel.cc:65] I am in the .cc kernel!!!!!!
Tensor(shape=[3], dtype=int64, place=Place(cpu), stop_gradient=True,
       [20 , 80 , 240])
>>> paddle.bitwise_left_shift(x, y, is_arithmetic=False)
I1217 04:15:30.973914 94440 bitwise_kernel.cc:66] I am in the .cc kernel!!!!!!
Tensor(shape=[3], dtype=int64, place=Place(cpu), stop_gradient=True,
       [20 , 80 , 240])
>>> paddle.bitwise_right_shift(x, y, is_arithmetic=True)
I1217 04:16:38.313091 94440 bitwise_kernel.cc:67] I am in the .cc kernel!!!!!!
Tensor(shape=[3], dtype=int64, place=Place(cpu), stop_gradient=True,
       [5, 5, 3])
>>> paddle.bitwise_right_shift(x, y, is_arithmetic=False)
I1217 04:16:45.778152 94440 bitwise_kernel.cc:68] I am in the .cc kernel!!!!!!
Tensor(shape=[3], dtype=int64, place=Place(cpu), stop_gradient=True,
       [5, 5, 3])

也是有glog信息的，不知道为啥coverage的ci说没覆盖到这四个宏。。。

另附符号数逻辑右移例子，以int8为例，-20补码1110,1100，逻辑右移两位0011,1011为59的补码

>> x = paddle.to_tensor([10,-20,30], dtype=paddle.int8, place=paddle.CPUPlace())
>> y = paddle.to_tensor([1,2,3], dtype=paddle.int8, place=paddle.CPUPlace())
>> paddle.bitwise_right_shift(x, y, is_arithmetic=False)
I1217 04:29:36.087157 94440 bitwise_kernel.cc:68] I am in the .cc kernel!!!!!!
Tensor(shape=[3], dtype=int8, place=Place(cpu), stop_gradient=True,
[5 , 59, 3 ])

cocoshe commented 5 months ago

@xuxinyi389 对另外funcs/bitwise_functors.h下的coverage检测没覆盖的问题，我也尝试了一下，在一系列Functor中加了LOG，如下:

template <typename T>
struct BitwiseLeftShiftArithmeticFunctor {
  HOSTDEVICE T operator()(const T a, const T b) const { 
    LOG(INFO) << "This is BitwiseLeftShiftArithmeticFunctor";               // line: 53
    return a << b; }
};

template <typename T>
struct InverseBitwiseLeftShiftArithmeticFunctor {
  inline HOSTDEVICE T operator()(const T a, const T b) const {     
    LOG(INFO) << "This is BitwiseLeftShiftArithmeticFunctor";
return b << a; }
};

template <typename T>
struct BitwiseLeftShiftLogicFunctor {
  HOSTDEVICE T operator()(const T a, const T b) const {     
    LOG(INFO) << "This is BitwiseLeftShiftArithmeticFunctor";
return a << b; }
};

然后安装whl包测试：

>>> x
Tensor(shape=[2, 3], dtype=uint8, place=Place(cpu), stop_gradient=True,
       [[8 , 12, 8 ],
        [2 , 12, 3 ]])
>>> y
Tensor(shape=[2, 3], dtype=uint8, place=Place(cpu), stop_gradient=True,
       [[1, 1, 4],
        [3, 3, 1]])
>>> paddle.bitwise_right_shift(x,y)
Tensor(shape=[2, 3], dtype=uint8, place=Place(cpu), stop_gradient=True,
       [[4, 6, 0],
        [0, 1, 1]])
>>> paddle.bitwise_left_shift(x,y)
I1218 01:42:49.461642 181565 bitwise_functors.h:53] This is BitwiseLeftShiftArithmeticFunctor
I1218 01:42:49.461676 181565 bitwise_functors.h:53] This is BitwiseLeftShiftArithmeticFunctor
I1218 01:42:49.461685 181565 bitwise_functors.h:53] This is BitwiseLeftShiftArithmeticFunctor
I1218 01:42:49.461694 181565 bitwise_functors.h:53] This is BitwiseLeftShiftArithmeticFunctor
I1218 01:42:49.461701 181565 bitwise_functors.h:53] This is BitwiseLeftShiftArithmeticFunctor
I1218 01:42:49.461710 181565 bitwise_functors.h:53] This is BitwiseLeftShiftArithmeticFunctor
Tensor(shape=[2, 3], dtype=uint8, place=Place(cpu), stop_gradient=True,
       [[16 , 24 , 128],
        [16 , 96 , 6  ]])
>>>

这个53行，也就是这个Functor的内部，是走到了的，不明白为什么coverage没检测到。

辛苦师傅帮忙review一下，hackathon 22号前ddl，想赶一赶尽量merge~

cocoshe commented 5 months ago

（欸好像coverage的CI突然×变√了，辛苦review~

xuxinyi389 commented 5 months ago

（欸好像coverage的CI突然×变√了，辛苦review~ 可能是现有coverage工具对于预处理阶段的处理逻辑代码存在不足，已经豁免

xuxinyi389 commented 5 months ago

所有的low，high记得同步修改，不只我评论的那几处

cocoshe commented 5 months ago

看CI过了，之前是发现如果直接位移的话，cpu和gpu下的位移结果有几率不一致，本地gpu测的可以和np的接口对齐，但是试了下换成cpu在左移的时候，如果移动位数很大，发现会偶尔自动发生取模优化（例如uint8左移20bit的时候，实际会左移(20%8=4)位），而gpu下的结果一直都是0，让它溢出。

另外如果移动的数y为负数时，在cpp中是一种"未定义的行为"，实际情况可能会被强转成unsigned的类型？比如a << (unsigned int)-5表示a << 0xFFFFFFFB（不过同样在cpu上的问题，无法对齐；在gpu上遇到y为负数时，左移时置零，右移时取决x的符号位，可以对齐）

The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.

所以为了和np的接口对齐和cpu、gpu结果一致，后来加的代码对这两方面在Functor中额外进行了一些约束。辛苦review~ @xuxinyi389

xuxinyi389 commented 5 months ago

看CI过了，之前是发现如果直接位移的话，cpu和gpu下的位移结果有几率不一致，本地gpu测的可以和np的接口对齐，但是试了下换成cpu在左移的时候，如果移动位数很大，发现会偶尔自动发生取模优化（例如uint8左移20bit的时候，实际会左移(20%8=4)位），而gpu下的结果一直都是0，让它溢出。

另外如果移动的数y为负数时，在cpp中是一种"未定义的行为"，实际情况可能会被强转成unsigned的类型？比如a << (unsigned int)-5表示a << 0xFFFFFFFB（不过同样在cpu上的问题，无法对齐；在gpu上遇到y为负数时，左移时置零，右移时取决x的符号位，可以对齐）

The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.

所以为了和np的接口对齐和cpu、gpu结果一致，后来加的代码对这两方面在Functor中额外进行了一些约束。辛苦review~ @xuxinyi389

了解，对于y为负数的情形。我们相应的设计逻辑和竞品的处理逻辑在RFC中并没有写（后续补充下，附上具体数值例子更好）。还有麻烦确认下对于y为负数时，竞品(pytorch)行为是如何处理的，是否和numpy行为一致。如果有不一致的地方，那个更为合理，说明理由，然后确认我们paddle的处理逻辑。如果赶不上截止日期，后续我帮忙沟通。辛苦了！！

cocoshe commented 5 months ago

了解，对于y为负数的情形。我们相应的设计逻辑和竞品的处理逻辑在RFC中并没有写（后续补充下，附上具体数值例子更好）。还有麻烦确认下对于y为负数时，竞品(pytorch)行为是如何处

我感觉负数情况下不确定的状态可能跟编译环境有关吗？我对负数情况和CPU与GPU结果不一致的情况也感觉比较奇怪，我看竞品似乎没有做特殊处理欸

cocoshe commented 5 months ago

看CI过了，之前是发现如果直接位移的话，cpu和gpu下的位移结果有几率不一致，本地gpu测的可以和np的接口对齐，但是试了下换成cpu在左移的时候，如果移动位数很大，发现会偶尔自动发生取模优化（例如uint8左移20bit的时候，实际会左移(20%8=4)位），而gpu下的结果一直都是0，让它溢出。另外如果移动的数y为负数时，在cpp中是一种"未定义的行为"，实际情况可能会被强转成unsigned的类型？比如a << (unsigned int)-5表示a << 0xFFFFFFFB（不过同样在cpu上的问题，无法对齐；在gpu上遇到y为负数时，左移时置零，右移时取决x的符号位，可以对齐）

The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.

所以为了和np的接口对齐和cpu、gpu结果一致，后来加的代码对这两方面在Functor中额外进行了一些约束。辛苦review~ @xuxinyi389

了解，对于y为负数的情形。我们相应的设计逻辑和竞品的处理逻辑在RFC中并没有写（后续补充下，附上具体数值例子更好）。还有麻烦确认下对于y为负数时，竞品(pytorch)行为是如何处理的，是否和numpy行为一致。如果有不一致的地方，那个更为合理，说明理由，然后确认我们paddle的处理逻辑。如果赶不上截止日期，后续我帮忙沟通。辛苦了！！

噢不好意思刚刚仔细看了下竞品的调度，它kernel的处理是这样的：

https://github.com/pytorch/pytorch/blob/3747aca49a39479c2c5e223b91369db5bd339cdf/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp#L423-L437

REGISTER_DISPATCH(lshift_stub, &lshift_kernel);
......
void lshift_kernel(TensorIteratorBase& iter) {
  AT_DISPATCH_INTEGRAL_TYPES(iter.dtype(), "lshift_cpu", [&]() {
    cpu_kernel_vec(
        iter,
        [](scalar_t a, scalar_t b) -> scalar_t {
          constexpr scalar_t max_shift = sizeof(scalar_t) * CHAR_BIT;
          if ((static_cast<std::make_signed_t<scalar_t>>(b) < 0) ||
              (b >= max_shift)) {
            return 0;
          }
          return static_cast<std::make_unsigned_t<scalar_t>>(a) << b;
        },
        [](Vectorized<scalar_t> a, Vectorized<scalar_t> b) { return a << b; });
  });
}

我今天再详细一下RFC吧~

cocoshe commented 5 months ago

@xuxinyi389 辛苦review看看还有什么需要补充的嘛？昨天在rfc中做了更新~，对竞品的处理逻辑进行了分析（算术左右位移、逻辑左右位移、无符号数、有符号数），列了一下所有情况的预期行为，应该和这个pr中的处理逻辑对齐了~

https://github.com/PaddlePaddle/community/pull/788#issuecomment-1864299531

具体行为定义：（n_bits表示数据类型存储位数，例如int8的n_bits为8，uint16的n_bits为16；当y小于0时为“未定义行为”，等效于位移超过最大位数溢出）

算术位移
- 算术左移：当y小于0，或者y大于等于n_bits时候溢出，返回0；否则正常位移；
- 算术右移：
- 有符号数时：当y小于0，或者y大于等于n_bits时候溢出，返回符号位(a>>(n_bits-1)&1)；否则正常位移；
- 无符号数时：当y小于0，或者y大于等于n_bits时候溢出，返回0；否则正常位移；
逻辑位移
- 逻辑左移：当y小于0，或者y大于等于n_bits时候溢出，返回0；否则正常位移；
- 逻辑右移：
- 有符号数时：当y小于0，或者y大于等于n_bits时候溢出，返回0；否则特殊位移：
```
template <typename T>
HOSTDEVICE T logic_shift_func(const T a, const T b) {
if (b < static_cast<T>(0) || b >= static_cast<T>(sizeof(T) * 8))
  return static_cast<T>(0);
T t = static_cast<T>(sizeof(T) * 8 - 1);
T mask = (((a >> t) << t) >> b) << 1;
return (a >> b) ^ mask;
}
```
  在T mask = (((a >> t) << t) >> b) << 1;中，先(a >> t)取符号位，然后<< t回到原位，再右移b后左移一位，最后与a>>b的结果做亦或，下面举两个例子：
```
example1:
a = 1001,1010  b = 3, 有t=7
((a>>t)<<t) = 1000,0000
mask=(((a>>t)<<t)>>b)<<1 = 1110,0000
a>>b = 1111,0011
所以 (a>>b) ^ mask = 0001,0011

example2:
a = 0001,1010 b = 3, 有t=7
((a>>t)<<t) = 0000,0000
mask=(((a>>t)<<t)>>b)<<1 = 0000,0000
a>>b = 0000,0011
所以 (a>>b) ^ mask = 0000，0011
```
- 无符号数时：当y小于0，或者y大于等于n_bits时候溢出，返回0；否则正常位移；
以上行为中，算术位移与numpy、pytorch的实现对齐；由于numpy和pytorch不支持逻辑位移，所以逻辑位移参考jax的实现思路，用numpy来进行间接实现和验证。

xuxinyi389 commented 5 months ago

由于numpy和pytorch不支持逻辑位移，所以逻辑位移参考jax的实现思路，用numpy来进行间接实现和验证。

代码看起来应该没问题，但是注意到测试数据的shape都比较小，可以增加个大shape的case，还有给RFC文档提了点意见，修改下文档，并给中文API文档也更新下。

cocoshe commented 5 months ago

由于numpy和pytorch不支持逻辑位移，所以逻辑位移参考jax的实现思路，用numpy来进行间接实现和验证。

代码看起来应该没问题，但是注意到测试数据的shape都比较小，可以增加个大shape的case，还有给RFC文档提了点意见，修改下文档，并给中文API文档也更新下。

嗯嗯ok，RFC文档和中文api文档的修改有什么建议呢？好像没有看到comment欸

xuxinyi389 commented 4 months ago

LGTM+通过CI，完善API文档，

cocoshe commented 4 months ago

好奇怪，之前的这个CI是正常过的，然后把shape调大一些，突然过不了，然后我现在甚至直接reset到那个commit，还是过不了（rerun了好多次了）。看了下log发现和之前成功过log有点不同：

没过的CI：https://xly.bce.baidu.com/paddlepaddle/paddle/newipipe/detail/9803458/job/24871692 不知道是啥原因emmm

xuxinyi389 commented 4 months ago

好奇怪，之前的这个CI是正常过的，然后把shape调大一些，突然过不了，然后我现在甚至直接reset到那个commit，还是过不了（rerun了好多次了）。看了下log发现和之前成功过log有点不同：

没过的CI：https://xly.bce.baidu.com/paddlepaddle/paddle/newipipe/detail/9803458/job/24871692 不知道是啥原因emmm

看日志报错类型是这个 fatal error C1202: recursive type or function dependency context too complex 这个流水线是windwos的cpu流水线，环境使用的是VS2017。可以尝试复现

cocoshe commented 4 months ago

看日志报错类型是这个 fatal error C1202: recursive type or function dependency context too complex 这个流水线是windwos的cpu流水线，环境使用的是VS2017。可以尝试复现

我手上没有Windows的机器，前两天就在这个pr的ci上测了下，..\paddle\fluid\pir\dialect\operator\ir\op_dialect.cc(57): fatal error C1202: recursive type or function dependency context too complex这个op_dialect.cc(57)看上去是算子注册相关的问题：

/* op_dialect.cc (57)*/

  // NOTE(zhangbo9674): GET_OP_LIST is defined in pd_op.h which is
  // generated by op_gen.py, see details in
  // paddle/fluid/pir/dialect/CMakeLists.txt.
  // NOTE(Ruting)GET_MANUAL_OP_LIST is define in manual_op.h"
  // use RegisterOps when list has more than two ops.
  RegisterOps<
#define GET_OP_LIST
#include "paddle/fluid/pir/dialect/operator/ir/pd_op_info.cc"  // NOLINT
      >();

测了下感觉问题应该是在算子定义的ops.yaml，在用脚本将其转换为pd_op.h的时候可能出现了一些问题。现在是配置了四个kernel，分别是算术左移、逻辑左移、算术右移、逻辑右移：

- op : bitwise_left_shift_arithmetic
  args : (Tensor x, Tensor y)
  output : Tensor(out)
  infer_meta :
    func : ElementwiseInferMeta
  kernel :
    func : bitwise_left_shift_arithmetic
    backend : x
  inplace: (x -> out)

- op : bitwise_left_shift_logic
  args : (Tensor x, Tensor y)
  output : Tensor(out)
  infer_meta :
    func : ElementwiseInferMeta
  kernel :
    func : bitwise_left_shift_logic
    backend : x
  inplace: (x -> out)

- op : bitwise_right_shift_arithmetic
  args : (Tensor x, Tensor y)
  output : Tensor(out)
  infer_meta :
    func : ElementwiseInferMeta
  kernel :
    func : bitwise_right_shift_arithmetic
    backend : x
  inplace: (x -> out)

- op : bitwise_right_shift_logic
  args : (Tensor x, Tensor y)
  output : Tensor(out)
  infer_meta :
    func : ElementwiseInferMeta
  kernel :
    func : bitwise_right_shift_logic
    backend : x
  inplace: (x -> out)

我先把这个pr中所有其他的部分（math，__init__，test_inplace，test_bitwise_shift，包括.cc和.cu里面添加的代码，.h里面的Functor等）全部注释掉，只留下ops.yaml的修改。似乎当我注释掉两个配置（比如注释掉bitwise_right_shift_arithmetic的配置和bitwise_right_shift_logic的配置）就没有发生那个fatal error C1202: recursive type or function dependency context too complex的错误了，但是如果我只注释掉其中一个kernel的配置（例如注释掉bitwise_right_shift_logic的yaml配置），他还是会出现那个错误。我也尝试直接把kernel名改成了aaaaa,bbbbb,ccccc（怀疑有命名冲突之类的情况），第四个kernel依旧注释掉：

- op : aaaaa
  args : (Tensor x, Tensor y)
  output : Tensor(out)
  infer_meta :
    func : ElementwiseInferMeta
  kernel :
    func : aaaaa
    backend : x
  inplace: (x -> out)

- op : bbbbb
  args : (Tensor x, Tensor y)
  output : Tensor(out)
  infer_meta :
    func : ElementwiseInferMeta
  kernel :
    func : bbbbb
    backend : x
  inplace: (x -> out)

- op : ccccc
  args : (Tensor x, Tensor y)
  output : Tensor(out)
  infer_meta :
    func : ElementwiseInferMeta
  kernel :
    func : ccccc
    backend : x
  inplace: (x -> out)

像这样结果还是报那个错误，所以感觉也不像是函数模板冲突导致C1202无限递归之类的情况，按理来说单纯把这三个当成独立的三个kernel来看，不应该有什么问题吧。。难道是算子的signature之类的发生了什么冲突吗，这个CI目前还没有什么头绪是什么原因。。。而且之前这个CI是正常过的，reset回去却过不了，难道是有什么缓存影响吗，但是ci每次应该是从头pull的，感觉也不应该，这个我感觉非常疑惑emmmm

xuxinyi389 commented 4 months ago

猜测是编译器的问题，对于op的数量长度有限制

xuxinyi389 commented 4 months ago

可能的解决方案：生成的文件pd_op_info.cc里的GET_OP_LIST长度过长导致，如下图所示。可以尝试在代码生成时将pd_op_info.cc的GET_OP_LIST拆分为长度较短的GET_OP_LIST1和GET_OP_LIST2，然后在op_dialect.cc调用中，分别调用GET_OP_LIST1和GET_OP_LIST2。@cocoshe

xuxinyi389 commented 4 months ago

分开注册可以只在windwos平台进行，避免引入其他问题。@cocoshe

cocoshe commented 4 months ago

感谢！看上去那几个ci过了，没想到MSVC还有这种限制。。。长见识了

cocoshe commented 4 months ago

@xuxinyi389 辛苦review，还有什么需要完善的嘛？

xuxinyi389 commented 4 months ago

LGTM

cocoshe commented 4 months ago

@jeff41404 Thanks for your review and instruction, I refactored the code, and put the if judgement to a more deeper cpp space instead of python api.

Because of the limitation of funcs::ElementwiseCompute, we can't add more args(for example, is_arithmetic here) to Functor if we want to reuse the code. I think the paddle/phi/kernels/funcs/bitwise_functors.h can't be unified.

jeff41404 commented 4 months ago

@jeff41404 Thanks for your review and instruction, I refactored the code, and put the if judgement to a more deeper cpp space instead of python api.

Because of the limitation of funcs::ElementwiseCompute, we can't add more args(for example, is_arithmetic here) to Functor if we wan't to reuse the code. I think the paddle/phi/kernels/funcs/bitwise_functors.h can't be unified.

OK, thanks