[TODO] 收集框架对于CustomDevice待优化项目

PaddlePaddle / PaddleCustomDevice

PaddlePaddle custom device implementaion. (『飞桨』自定义硬件接入实现)

Apache License 2.0

72 stars 151 forks source link

[TODO] 收集框架对于CustomDevice待优化项目 #500

Closed ronny1996 closed 4 months ago

ronny1996 commented 1 year ago

Paddle集成CustomDevice时，主要考虑对不同硬件通用，部分实现性能较低，这里收集硬件期望优化的代码

ShawnNew commented 1 year ago

希望Paddle在set_constant_with_place中支持插件的锁页内存调用，从而可以优化网路训练时clear_grad的性能。

ronny1996 commented 1 year ago

https://github.com/PaddlePaddle/Paddle/pull/52872 支持 MP，c_* 算子使用小算子组合实现，性能可能较差

ronny1996 commented 1 year ago

Concat & Split 存在大量 memcpy_d2d 调用

YanhuiDua commented 1 year ago

可以开发类似于pytorch中set_option的功能，开启部分算子编译选项参考链接：https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/63RC2alpha002/ptmoddevg/ptmigr/ptmigr_0094.html

KimBioInfoStudio commented 1 year ago

主框架是使用的 flat_hash_map 中使用的 https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/utils/flat_hash_map.h#L743 deallocate_data 在 LLVM 中是非法的

ronny1996 commented 1 year ago

希望Paddle在set_constant_with_place中支持插件的锁页内存调用，从而可以优化网路训练时clear_grad的性能。

https://github.com/PaddlePaddle/Paddle/pull/55089 使用full kernel代替memcpy

ShawnNew commented 1 year ago

https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/bloom#%E6%A8%A1%E5%9E%8B-finetune 运行bloom-560MB爆显存

YanhuiDua commented 1 year ago

https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/bloom#%E6%A8%A1%E5%9E%8B-finetune 运行bloom-560MB爆显存

你好，这个建议提一个issue，贴一下运行设备/版本/配置

ShawnNew commented 1 year ago

https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/bloom#%E6%A8%A1%E5%9E%8B-finetune 运行bloom-560MB爆显存

你好，这个建议提一个issue，贴一下运行设备/版本/配置

已建新issue：https://github.com/PaddlePaddle/PaddleCustomDevice/issues/736

ronny1996 commented 1 year ago

add kernel要求支持 inplace https://github.com/PaddlePaddle/Paddle/pull/56205

jinyouzhi commented 1 year ago

建议init_devices()失败时能fallback回cpu版paddlepaddle，给出显著提示就好，而不是直接crash

tiandou-tangdou commented 1 year ago

https://github.com/PaddlePaddle/Paddle/issues/56416

tiandou-tangdou commented 1 year ago

Context中自定义内容 https://github.com/PaddlePaddle/Paddle/issues/54709 是否会考虑支持

ronny1996 commented 1 year ago

Context中自定义内容 PaddlePaddle/Paddle#54709 是否会考虑支持

你好，目前新增这个接口有些困难，其依赖的 StatRegistry 采用硬编码的方式只支持 16 个设备，这与 custom device 插件式设计冲突，必要的时候我们会增加这个接口，请问下，目前主要什么场景需要用到该接口？

tiandou-tangdou commented 1 year ago

Context中自定义内容 PaddlePaddle/Paddle#54709 是否会考虑支持

你好，目前新增这个接口有些困难，其依赖的 StatRegistry 采用硬编码的方式只支持 16 个设备，这与 custom device 插件式设计冲突，必要的时候我们会增加这个接口，请问下，目前主要什么场景需要用到该接口？

感谢回复，只支持 16 个设备这个点也无法满足我们的需要，我们的device单机就会到32个，所以我们先扩了 StatRegistry到32； Context中generator IncrementOffset我们添加了PADDLE_WITH_CUSTOM_DEVICE暂时能满足使用； allocator我理解是设计上是否给CustomDevice开放这种使用场景，所以咨询下。

engineer1109 commented 1 year ago

需要融合算子的简易Pass

qili93 commented 9 months ago

需要融合算子的简易Pass

这个建议参考 https://github.com/PaddlePaddle/PaddleCustomDevice/blob/develop/backends/npu/tests/unittests/test_custom_pass_npu.py#L25 这个单测文件，可以通过python接口定义自定义Pass并注册到框架。

engineer1109 commented 9 months ago

需要融合算子的简易Pass

这个建议参考 https://github.com/PaddlePaddle/PaddleCustomDevice/blob/develop/backends/npu/tests/unittests/test_custom_pass_npu.py#L25 这个单测文件，可以通过python接口定义自定义Pass并注册到框架。

问题是我不用Python

engineer1109 commented 9 months ago

Custom Device 还缺少INT8量化支持的pass

engineer1109 commented 9 months ago

需要融合算子的简易Pass

这个建议参考 https://github.com/PaddlePaddle/PaddleCustomDevice/blob/develop/backends/npu/tests/unittests/test_custom_pass_npu.py#L25 这个单测文件，可以通过python接口定义自定义Pass并注册到框架。

问题是我不用Python

C++的cuda pass，Custom Device也能正常使用了。