Oneflow-Inc / oneflow

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
http://www.oneflow.org
Apache License 2.0
5.78k stars 658 forks source link

fix fuse-bn-add-relu bug. #10533

Closed cccddd77 closed 1 month ago

cccddd77 commented 1 month ago

--Graph --use-fp16 --fuse-bn-relu --fuse-bn-add-relu ...参数训练时,eval阶段会在FusedNormalizationAddRelu算子处报错,原因是cudnn_fused_normalization_add_relu算子中使用的cudnn接口只在训练阶段才支持CUDNN_BATCHNORM_OPS_BN_ACTIVATION / CUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION操作,相关代码逻辑也是根据训练阶段写的,不适配推理阶段,所以在这个Pass处需要根据是否为推理阶段来做算子替换。

github-actions[bot] commented 1 month ago

View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10533/

github-actions[bot] commented 1 month ago
Speed stats: ``` GPU Name: NVIDIA GeForce RTX 3080 Ti ❌ OneFlow resnet50 time: 43.7ms (= 4371.0ms / 100, input_shape=[16, 3, 224, 224]) PyTorch resnet50 time: 57.8ms (= 5779.1ms / 100, input_shape=[16, 3, 224, 224]) ✔️ Relative speed: 1.32 (= 57.8ms / 43.7ms) OneFlow resnet50 time: 26.1ms (= 2607.0ms / 100, input_shape=[8, 3, 224, 224]) PyTorch resnet50 time: 36.8ms (= 3681.6ms / 100, input_shape=[8, 3, 224, 224]) ✔️ Relative speed: 1.41 (= 36.8ms / 26.1ms) OneFlow resnet50 time: 18.4ms (= 3681.1ms / 200, input_shape=[4, 3, 224, 224]) PyTorch resnet50 time: 35.7ms (= 7137.5ms / 200, input_shape=[4, 3, 224, 224]) ✔️ Relative speed: 1.94 (= 35.7ms / 18.4ms) OneFlow resnet50 time: 16.9ms (= 3383.7ms / 200, input_shape=[2, 3, 224, 224]) PyTorch resnet50 time: 32.5ms (= 6505.9ms / 200, input_shape=[2, 3, 224, 224]) ✔️ Relative speed: 1.92 (= 32.5ms / 16.9ms) OneFlow resnet50 time: 17.3ms (= 3458.9ms / 200, input_shape=[1, 3, 224, 224]) PyTorch resnet50 time: 28.5ms (= 5695.3ms / 200, input_shape=[1, 3, 224, 224]) ✔️ Relative speed: 1.65 (= 28.5ms / 17.3ms) OneFlow swin dataloader time: 0.200s (= 40.099s / 200, num_workers=1) PyTorch swin dataloader time: 0.128s (= 25.633s / 200, num_workers=1) Relative speed: 0.639 (= 0.128s / 0.200s) OneFlow swin dataloader time: 0.054s (= 10.876s / 200, num_workers=4) PyTorch swin dataloader time: 0.033s (= 6.554s / 200, num_workers=4) Relative speed: 0.603 (= 0.033s / 0.054s) OneFlow swin dataloader time: 0.031s (= 6.162s / 200, num_workers=8) PyTorch swin dataloader time: 0.017s (= 3.316s / 200, num_workers=8) Relative speed: 0.538 (= 0.017s / 0.031s) ❌ OneFlow resnet50 time: 49.3ms (= 4925.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 64.7ms (= 6472.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.31 (= 64.7ms / 49.3ms) OneFlow resnet50 time: 37.4ms (= 3739.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 47.4ms (= 4741.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.27 (= 47.4ms / 37.4ms) OneFlow resnet50 time: 27.9ms (= 5580.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 38.8ms (= 7768.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.39 (= 38.8ms / 27.9ms) OneFlow resnet50 time: 25.0ms (= 5007.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 38.3ms (= 7665.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.53 (= 38.3ms / 25.0ms) OneFlow resnet50 time: 24.7ms (= 4944.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 35.8ms (= 7161.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.45 (= 35.8ms / 24.7ms) ```
github-actions[bot] commented 1 month ago
Speed stats: ``` GPU Name: NVIDIA GeForce RTX 3080 Ti ❌ OneFlow resnet50 time: 43.8ms (= 4375.8ms / 100, input_shape=[16, 3, 224, 224]) PyTorch resnet50 time: 56.6ms (= 5656.7ms / 100, input_shape=[16, 3, 224, 224]) ✔️ Relative speed: 1.29 (= 56.6ms / 43.8ms) OneFlow resnet50 time: 26.2ms (= 2621.3ms / 100, input_shape=[8, 3, 224, 224]) PyTorch resnet50 time: 37.4ms (= 3736.6ms / 100, input_shape=[8, 3, 224, 224]) ✔️ Relative speed: 1.43 (= 37.4ms / 26.2ms) OneFlow resnet50 time: 18.9ms (= 3786.0ms / 200, input_shape=[4, 3, 224, 224]) PyTorch resnet50 time: 35.5ms (= 7101.2ms / 200, input_shape=[4, 3, 224, 224]) ✔️ Relative speed: 1.88 (= 35.5ms / 18.9ms) OneFlow resnet50 time: 17.3ms (= 3450.5ms / 200, input_shape=[2, 3, 224, 224]) PyTorch resnet50 time: 31.8ms (= 6358.1ms / 200, input_shape=[2, 3, 224, 224]) ✔️ Relative speed: 1.84 (= 31.8ms / 17.3ms) OneFlow resnet50 time: 17.0ms (= 3397.8ms / 200, input_shape=[1, 3, 224, 224]) PyTorch resnet50 time: 28.8ms (= 5764.4ms / 200, input_shape=[1, 3, 224, 224]) ✔️ Relative speed: 1.70 (= 28.8ms / 17.0ms) OneFlow swin dataloader time: 0.200s (= 40.026s / 200, num_workers=1) PyTorch swin dataloader time: 0.129s (= 25.706s / 200, num_workers=1) Relative speed: 0.642 (= 0.129s / 0.200s) OneFlow swin dataloader time: 0.054s (= 10.768s / 200, num_workers=4) PyTorch swin dataloader time: 0.033s (= 6.578s / 200, num_workers=4) Relative speed: 0.611 (= 0.033s / 0.054s) OneFlow swin dataloader time: 0.031s (= 6.177s / 200, num_workers=8) PyTorch swin dataloader time: 0.017s (= 3.349s / 200, num_workers=8) Relative speed: 0.542 (= 0.017s / 0.031s) ❌ OneFlow resnet50 time: 49.4ms (= 4942.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 65.1ms (= 6513.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.32 (= 65.1ms / 49.4ms) OneFlow resnet50 time: 36.7ms (= 3665.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 46.5ms (= 4653.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.27 (= 46.5ms / 36.7ms) OneFlow resnet50 time: 27.7ms (= 5536.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 39.6ms (= 7927.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.43 (= 39.6ms / 27.7ms) OneFlow resnet50 time: 25.2ms (= 5046.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 38.5ms (= 7700.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.53 (= 38.5ms / 25.2ms) OneFlow resnet50 time: 24.9ms (= 4989.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 36.9ms (= 7381.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.48 (= 36.9ms / 24.9ms) ```