Oneflow-Inc / oneflow

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
http://www.oneflow.org
Apache License 2.0
5.79k stars 658 forks source link

modify clip_grad with no to_global #10443

Open hanwen-sun opened 3 months ago

hanwen-sun commented 3 months ago

去掉clip_grad 范数计算中的第一个to_global, 以减少在tensor parallel情况下不必要的 all gather

github-actions[bot] commented 3 months ago

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions[bot] commented 3 months ago

View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10443/

github-actions[bot] commented 3 months ago
Speed stats: ``` GPU Name: NVIDIA GeForce RTX 3080 Ti ❌ OneFlow resnet50 time: 44.0ms (= 4398.0ms / 100, input_shape=[16, 3, 224, 224]) PyTorch resnet50 time: 58.1ms (= 5810.7ms / 100, input_shape=[16, 3, 224, 224]) ✔️ Relative speed: 1.32 (= 58.1ms / 44.0ms) OneFlow resnet50 time: 26.1ms (= 2606.6ms / 100, input_shape=[8, 3, 224, 224]) PyTorch resnet50 time: 38.5ms (= 3845.4ms / 100, input_shape=[8, 3, 224, 224]) ✔️ Relative speed: 1.48 (= 38.5ms / 26.1ms) OneFlow resnet50 time: 19.1ms (= 3815.7ms / 200, input_shape=[4, 3, 224, 224]) PyTorch resnet50 time: 35.9ms (= 7176.3ms / 200, input_shape=[4, 3, 224, 224]) ✔️ Relative speed: 1.88 (= 35.9ms / 19.1ms) OneFlow resnet50 time: 16.9ms (= 3383.3ms / 200, input_shape=[2, 3, 224, 224]) PyTorch resnet50 time: 31.7ms (= 6337.6ms / 200, input_shape=[2, 3, 224, 224]) ✔️ Relative speed: 1.87 (= 31.7ms / 16.9ms) OneFlow resnet50 time: 22.3ms (= 4460.8ms / 200, input_shape=[1, 3, 224, 224]) PyTorch resnet50 time: 29.5ms (= 5908.5ms / 200, input_shape=[1, 3, 224, 224]) ✔️ Relative speed: 1.32 (= 29.5ms / 22.3ms) OneFlow swin dataloader time: 0.202s (= 40.326s / 200, num_workers=1) PyTorch swin dataloader time: 0.128s (= 25.677s / 200, num_workers=1) Relative speed: 0.637 (= 0.128s / 0.202s) OneFlow swin dataloader time: 0.054s (= 10.740s / 200, num_workers=4) PyTorch swin dataloader time: 0.040s (= 8.059s / 200, num_workers=4) Relative speed: 0.750 (= 0.040s / 0.054s) OneFlow swin dataloader time: 0.031s (= 6.112s / 200, num_workers=8) PyTorch swin dataloader time: 0.016s (= 3.300s / 200, num_workers=8) Relative speed: 0.540 (= 0.016s / 0.031s) ❌ OneFlow resnet50 time: 49.1ms (= 4909.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 69.7ms (= 6972.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.42 (= 69.7ms / 49.1ms) OneFlow resnet50 time: 35.7ms (= 3566.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 48.3ms (= 4829.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.35 (= 48.3ms / 35.7ms) OneFlow resnet50 time: 29.1ms (= 5824.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 44.3ms (= 8851.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.52 (= 44.3ms / 29.1ms) OneFlow resnet50 time: 25.9ms (= 5183.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 41.1ms (= 8216.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.58 (= 41.1ms / 25.9ms) OneFlow resnet50 time: 24.0ms (= 4791.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 36.3ms (= 7258.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.51 (= 36.3ms / 24.0ms) ```
levi131 commented 3 months ago

ci中clip_grad相关的单测没有通过,需要再调试一下

github-actions[bot] commented 3 months ago

CI failed when running job: cuda-misc. PR label automerge has been removed

github-actions[bot] commented 3 months ago

CI failed when running job: cuda-module. PR label automerge has been removed

github-actions[bot] commented 3 months ago

CI failed when running job: cuda-module. PR label automerge has been removed

github-actions[bot] commented 3 months ago

CI failed when running job: cuda-module. PR label automerge has been removed

github-actions[bot] commented 3 months ago

View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10443/

github-actions[bot] commented 3 months ago

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions[bot] commented 3 months ago

View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10443/

github-actions[bot] commented 2 months ago

View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10443/

github-actions[bot] commented 2 months ago

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions[bot] commented 2 months ago

View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10443/

github-actions[bot] commented 2 months ago
Speed stats: ``` GPU Name: NVIDIA GeForce RTX 3080 Ti ❌ OneFlow resnet50 time: 43.7ms (= 4372.3ms / 100, input_shape=[16, 3, 224, 224]) PyTorch resnet50 time: 57.8ms (= 5775.6ms / 100, input_shape=[16, 3, 224, 224]) ✔️ Relative speed: 1.32 (= 57.8ms / 43.7ms) OneFlow resnet50 time: 26.2ms (= 2622.2ms / 100, input_shape=[8, 3, 224, 224]) PyTorch resnet50 time: 38.0ms (= 3801.6ms / 100, input_shape=[8, 3, 224, 224]) ✔️ Relative speed: 1.45 (= 38.0ms / 26.2ms) OneFlow resnet50 time: 19.1ms (= 3814.3ms / 200, input_shape=[4, 3, 224, 224]) PyTorch resnet50 time: 35.7ms (= 7133.9ms / 200, input_shape=[4, 3, 224, 224]) ✔️ Relative speed: 1.87 (= 35.7ms / 19.1ms) OneFlow resnet50 time: 16.4ms (= 3286.1ms / 200, input_shape=[2, 3, 224, 224]) PyTorch resnet50 time: 34.2ms (= 6833.8ms / 200, input_shape=[2, 3, 224, 224]) ✔️ Relative speed: 2.08 (= 34.2ms / 16.4ms) OneFlow resnet50 time: 17.3ms (= 3460.1ms / 200, input_shape=[1, 3, 224, 224]) PyTorch resnet50 time: 29.5ms (= 5908.9ms / 200, input_shape=[1, 3, 224, 224]) ✔️ Relative speed: 1.71 (= 29.5ms / 17.3ms) OneFlow swin dataloader time: 0.199s (= 39.800s / 200, num_workers=1) PyTorch swin dataloader time: 0.130s (= 25.972s / 200, num_workers=1) Relative speed: 0.653 (= 0.130s / 0.199s) OneFlow swin dataloader time: 0.056s (= 11.289s / 200, num_workers=4) PyTorch swin dataloader time: 0.033s (= 6.521s / 200, num_workers=4) Relative speed: 0.578 (= 0.033s / 0.056s) OneFlow swin dataloader time: 0.032s (= 6.384s / 200, num_workers=8) PyTorch swin dataloader time: 0.018s (= 3.696s / 200, num_workers=8) Relative speed: 0.579 (= 0.018s / 0.032s) ❌ OneFlow resnet50 time: 49.2ms (= 4920.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 65.5ms (= 6548.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.33 (= 65.5ms / 49.2ms) OneFlow resnet50 time: 36.3ms (= 3626.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 44.9ms (= 4489.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.24 (= 44.9ms / 36.3ms) OneFlow resnet50 time: 27.6ms (= 5529.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 38.6ms (= 7729.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.40 (= 38.6ms / 27.6ms) OneFlow resnet50 time: 25.0ms (= 5006.5ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 38.6ms (= 7716.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.54 (= 38.6ms / 25.0ms) OneFlow resnet50 time: 24.8ms (= 4953.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 36.1ms (= 7218.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.46 (= 36.1ms / 24.8ms) ```
github-actions[bot] commented 2 months ago

View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10443/

github-actions[bot] commented 2 months ago
Speed stats: ``` GPU Name: NVIDIA GeForce RTX 3080 Ti ❌ OneFlow resnet50 time: 43.7ms (= 4370.5ms / 100, input_shape=[16, 3, 224, 224]) PyTorch resnet50 time: 57.9ms (= 5785.8ms / 100, input_shape=[16, 3, 224, 224]) ✔️ Relative speed: 1.32 (= 57.9ms / 43.7ms) OneFlow resnet50 time: 26.1ms (= 2607.5ms / 100, input_shape=[8, 3, 224, 224]) PyTorch resnet50 time: 38.0ms (= 3796.4ms / 100, input_shape=[8, 3, 224, 224]) ✔️ Relative speed: 1.46 (= 38.0ms / 26.1ms) OneFlow resnet50 time: 18.3ms (= 3666.2ms / 200, input_shape=[4, 3, 224, 224]) PyTorch resnet50 time: 34.3ms (= 6856.0ms / 200, input_shape=[4, 3, 224, 224]) ✔️ Relative speed: 1.87 (= 34.3ms / 18.3ms) OneFlow resnet50 time: 17.2ms (= 3444.0ms / 200, input_shape=[2, 3, 224, 224]) PyTorch resnet50 time: 31.2ms (= 6241.2ms / 200, input_shape=[2, 3, 224, 224]) ✔️ Relative speed: 1.81 (= 31.2ms / 17.2ms) OneFlow resnet50 time: 16.7ms (= 3334.0ms / 200, input_shape=[1, 3, 224, 224]) PyTorch resnet50 time: 28.3ms (= 5651.0ms / 200, input_shape=[1, 3, 224, 224]) ✔️ Relative speed: 1.69 (= 28.3ms / 16.7ms) OneFlow swin dataloader time: 0.198s (= 39.676s / 200, num_workers=1) PyTorch swin dataloader time: 0.128s (= 25.627s / 200, num_workers=1) Relative speed: 0.646 (= 0.128s / 0.198s) OneFlow swin dataloader time: 0.055s (= 11.083s / 200, num_workers=4) PyTorch swin dataloader time: 0.032s (= 6.457s / 200, num_workers=4) Relative speed: 0.583 (= 0.032s / 0.055s) OneFlow swin dataloader time: 0.031s (= 6.240s / 200, num_workers=8) PyTorch swin dataloader time: 0.017s (= 3.368s / 200, num_workers=8) Relative speed: 0.540 (= 0.017s / 0.031s) ❌ OneFlow resnet50 time: 49.4ms (= 4936.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 65.9ms (= 6591.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.34 (= 65.9ms / 49.4ms) OneFlow resnet50 time: 36.6ms (= 3656.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 44.6ms (= 4460.7ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.22 (= 44.6ms / 36.6ms) OneFlow resnet50 time: 27.8ms (= 5561.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 39.4ms (= 7885.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.42 (= 39.4ms / 27.8ms) OneFlow resnet50 time: 25.5ms (= 5103.5ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 38.2ms (= 7645.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.50 (= 38.2ms / 25.5ms) OneFlow resnet50 time: 25.0ms (= 4995.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 36.2ms (= 7235.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.45 (= 36.2ms / 25.0ms) ```
github-actions[bot] commented 2 months ago

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions[bot] commented 2 months ago
Speed stats: ``` ```
github-actions[bot] commented 2 months ago

View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10443/

github-actions[bot] commented 2 months ago
Speed stats: ``` GPU Name: NVIDIA GeForce RTX 3080 Ti ❌ OneFlow resnet50 time: 43.8ms (= 4378.7ms / 100, input_shape=[16, 3, 224, 224]) PyTorch resnet50 time: 58.1ms (= 5806.3ms / 100, input_shape=[16, 3, 224, 224]) ✔️ Relative speed: 1.33 (= 58.1ms / 43.8ms) OneFlow resnet50 time: 26.8ms (= 2675.1ms / 100, input_shape=[8, 3, 224, 224]) PyTorch resnet50 time: 37.9ms (= 3794.8ms / 100, input_shape=[8, 3, 224, 224]) ✔️ Relative speed: 1.42 (= 37.9ms / 26.8ms) OneFlow resnet50 time: 18.6ms (= 3724.7ms / 200, input_shape=[4, 3, 224, 224]) PyTorch resnet50 time: 37.0ms (= 7393.5ms / 200, input_shape=[4, 3, 224, 224]) ✔️ Relative speed: 1.99 (= 37.0ms / 18.6ms) OneFlow resnet50 time: 15.9ms (= 3183.7ms / 200, input_shape=[2, 3, 224, 224]) PyTorch resnet50 time: 30.9ms (= 6171.0ms / 200, input_shape=[2, 3, 224, 224]) ✔️ Relative speed: 1.94 (= 30.9ms / 15.9ms) OneFlow resnet50 time: 17.5ms (= 3509.0ms / 200, input_shape=[1, 3, 224, 224]) PyTorch resnet50 time: 29.4ms (= 5871.0ms / 200, input_shape=[1, 3, 224, 224]) ✔️ Relative speed: 1.67 (= 29.4ms / 17.5ms) OneFlow swin dataloader time: 0.201s (= 40.136s / 200, num_workers=1) PyTorch swin dataloader time: 0.129s (= 25.741s / 200, num_workers=1) Relative speed: 0.641 (= 0.129s / 0.201s) OneFlow swin dataloader time: 0.052s (= 10.493s / 200, num_workers=4) PyTorch swin dataloader time: 0.033s (= 6.639s / 200, num_workers=4) Relative speed: 0.633 (= 0.033s / 0.052s) OneFlow swin dataloader time: 0.030s (= 5.987s / 200, num_workers=8) PyTorch swin dataloader time: 0.016s (= 3.298s / 200, num_workers=8) Relative speed: 0.551 (= 0.016s / 0.030s) ❌ OneFlow resnet50 time: 49.3ms (= 4934.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 66.0ms (= 6596.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.34 (= 66.0ms / 49.3ms) OneFlow resnet50 time: 37.0ms (= 3701.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 47.3ms (= 4725.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.28 (= 47.3ms / 37.0ms) OneFlow resnet50 time: 27.6ms (= 5529.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 38.5ms (= 7699.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.39 (= 38.5ms / 27.6ms) OneFlow resnet50 time: 25.0ms (= 5008.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 40.3ms (= 8068.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.61 (= 40.3ms / 25.0ms) OneFlow resnet50 time: 24.6ms (= 4922.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 36.0ms (= 7206.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.46 (= 36.0ms / 24.6ms) ```
github-actions[bot] commented 2 months ago
Speed stats: ``` GPU Name: NVIDIA GeForce RTX 3080 Ti ❌ OneFlow resnet50 time: 43.9ms (= 4393.5ms / 100, input_shape=[16, 3, 224, 224]) PyTorch resnet50 time: 57.5ms (= 5751.8ms / 100, input_shape=[16, 3, 224, 224]) ✔️ Relative speed: 1.31 (= 57.5ms / 43.9ms) OneFlow resnet50 time: 26.6ms (= 2659.6ms / 100, input_shape=[8, 3, 224, 224]) PyTorch resnet50 time: 38.2ms (= 3816.2ms / 100, input_shape=[8, 3, 224, 224]) ✔️ Relative speed: 1.43 (= 38.2ms / 26.6ms) OneFlow resnet50 time: 17.7ms (= 3543.4ms / 200, input_shape=[4, 3, 224, 224]) PyTorch resnet50 time: 34.4ms (= 6878.0ms / 200, input_shape=[4, 3, 224, 224]) ✔️ Relative speed: 1.94 (= 34.4ms / 17.7ms) OneFlow resnet50 time: 16.4ms (= 3283.8ms / 200, input_shape=[2, 3, 224, 224]) PyTorch resnet50 time: 30.7ms (= 6149.7ms / 200, input_shape=[2, 3, 224, 224]) ✔️ Relative speed: 1.87 (= 30.7ms / 16.4ms) OneFlow resnet50 time: 16.5ms (= 3301.3ms / 200, input_shape=[1, 3, 224, 224]) PyTorch resnet50 time: 29.8ms (= 5965.3ms / 200, input_shape=[1, 3, 224, 224]) ✔️ Relative speed: 1.81 (= 29.8ms / 16.5ms) OneFlow swin dataloader time: 0.200s (= 39.976s / 200, num_workers=1) PyTorch swin dataloader time: 0.128s (= 25.586s / 200, num_workers=1) Relative speed: 0.640 (= 0.128s / 0.200s) OneFlow swin dataloader time: 0.056s (= 11.252s / 200, num_workers=4) PyTorch swin dataloader time: 0.033s (= 6.562s / 200, num_workers=4) Relative speed: 0.583 (= 0.033s / 0.056s) OneFlow swin dataloader time: 0.032s (= 6.326s / 200, num_workers=8) PyTorch swin dataloader time: 0.017s (= 3.360s / 200, num_workers=8) Relative speed: 0.531 (= 0.017s / 0.032s) ❌ OneFlow resnet50 time: 49.5ms (= 4953.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 66.2ms (= 6618.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.34 (= 66.2ms / 49.5ms) OneFlow resnet50 time: 35.8ms (= 3581.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 45.5ms (= 4550.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.27 (= 45.5ms / 35.8ms) OneFlow resnet50 time: 28.0ms (= 5605.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 39.8ms (= 7951.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.42 (= 39.8ms / 28.0ms) OneFlow resnet50 time: 25.3ms (= 5067.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 39.1ms (= 7827.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.54 (= 39.1ms / 25.3ms) OneFlow resnet50 time: 24.4ms (= 4882.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2) PyTorch resnet50 time: 35.7ms (= 7144.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2) ✔️ Relative speed: 1.46 (= 35.7ms / 24.4ms) ```
hanwen-sun commented 1 month ago

问题

该pr目前仍存在一个问题: clip_grad的1n2d的测试通不过, 我在相同的硬件设备(26, 28机器)上使用与ci环境相同的docker, 并使用该pr编译好的whl, 依旧无法复现ci中的问题。