Oneflow-Inc / oneflow

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
http://www.oneflow.org
Apache License 2.0
5.78k stars 658 forks source link

add_npu_support for goup_norm #10496

Closed woaixiaoxiao closed 1 month ago

woaixiaoxiao commented 2 months ago

原来的实现中只考虑了cuda作为底层实现,而没考虑npu,mlu等其他硬件

CLAassistant commented 2 months ago

CLA assistant check
All committers have signed the CLA.

github-actions[bot] commented 1 month ago

View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10496/

github-actions[bot] commented 1 month ago

View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10496/

github-actions[bot] commented 1 month ago

View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10496/

github-actions[bot] commented 1 month ago
Speed stats: ``` GPU Name: NVIDIA GeForce RTX 3080 Ti ❌ OneFlow resnet50 time: 43.7ms (= 4373.8ms / 100, input_shape=[16, 3, 224, 224]) PyTorch resnet50 time: 57.7ms (= 5772.1ms / 100, input_shape=[16, 3, 224, 224]) ✔️ Relative speed: 1.32 (= 57.7ms / 43.7ms) OneFlow resnet50 time: 26.1ms (= 2612.8ms / 100, input_shape=[8, 3, 224, 224]) PyTorch resnet50 time: 38.1ms (= 3805.6ms / 100, input_shape=[8, 3, 224, 224]) ✔️ Relative speed: 1.46 (= 38.1ms / 26.1ms) OneFlow resnet50 time: 18.6ms (= 3724.1ms / 200, input_shape=[4, 3, 224, 224]) PyTorch resnet50 time: 35.6ms (= 7111.7ms / 200, input_shape=[4, 3, 224, 224]) ✔️ Relative speed: 1.91 (= 35.6ms / 18.6ms) OneFlow resnet50 time: 17.2ms (= 3438.0ms / 200, input_shape=[2, 3, 224, 224]) PyTorch resnet50 time: 30.9ms (= 6178.6ms / 200, input_shape=[2, 3, 224, 224]) ✔️ Relative speed: 1.80 (= 30.9ms / 17.2ms) OneFlow resnet50 time: 17.3ms (= 3469.3ms / 200, input_shape=[1, 3, 224, 224]) PyTorch resnet50 time: 27.9ms (= 5589.4ms / 200, input_shape=[1, 3, 224, 224]) ✔️ Relative speed: 1.61 (= 27.9ms / 17.3ms) OneFlow swin dataloader time: 0.201s (= 40.146s / 200, num_workers=1) PyTorch swin dataloader time: 0.127s (= 25.471s / 200, num_workers=1) Relative speed: 0.634 (= 0.127s / 0.201s) OneFlow swin dataloader time: 0.057s (= 11.462s / 200, num_workers=4) PyTorch swin dataloader time: 0.032s (= 6.445s / 200, num_workers=4) Relative speed: 0.562 (= 0.032s / 0.057s) OneFlow swin dataloader time: 0.032s (= 6.326s / 200, num_workers=8) PyTorch swin dataloader time: 0.017s (= 3.340s / 200, num_workers=8) Relative speed: 0.528 (= 0.017s / 0.032s) ❌ OneFlow resnet50 time: 49.2ms (= 4918.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 66.6ms (= 6660.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.35 (= 66.6ms / 49.2ms) OneFlow resnet50 time: 36.5ms (= 3647.7ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 45.8ms (= 4578.7ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.26 (= 45.8ms / 36.5ms) OneFlow resnet50 time: 27.8ms (= 5556.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 38.4ms (= 7671.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.38 (= 38.4ms / 27.8ms) OneFlow resnet50 time: 25.3ms (= 5065.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 38.5ms (= 7697.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.52 (= 38.5ms / 25.3ms) OneFlow resnet50 time: 24.5ms (= 4898.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 35.9ms (= 7175.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.46 (= 35.9ms / 24.5ms) ```