Oneflow-Inc / libai

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
https://libai.readthedocs.io
Apache License 2.0
389 stars 55 forks source link

[Bug][MT5]在特定环境下出现 build model 时间很久的问题 #404

Closed strint closed 1 year ago

strint commented 1 year ago

git branch:https://github.com/Oneflow-Inc/oneflow/pull/9245

安装方法: python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/release/compile_cost_cnt/cu112

下面是开了 GLOG_v=1 时记录的日志:

image

可以看到,在文本的第12和13行之间,间隔了24分钟。

strint commented 1 year ago

第12行的 ibverbs 相关的,对应Libai仓库的 flow.env.init_rdma() 操作;

第13行的 version.cpp::22 应该是这里调用的

Maybe<void> MultiClientSessionContext::TryInit(const ConfigProto& config_proto) {
  if (!is_inited_) {
    DumpVersionInfo();

对应第一个 graph 的初始化时的 Session 初始化 python/oneflow/nn/graph/graph.py

self._session.TryInit()

所以这段24分钟的时间空隙,是 flow.env.init_rdma() 和 graph 创建之间的代码在执行。

xiezipeng-ML commented 1 year ago

第12行的 ibverbs 相关的,对应Libai仓库的 flow.env.init_rdma() 操作;

第13行的 version.cpp::22 应该是这里调用的

Maybe<void> MultiClientSessionContext::TryInit(const ConfigProto& config_proto) {
  if (!is_inited_) {
    DumpVersionInfo();

对应第一个 graph 的初始化时的 Session 初始化 python/oneflow/nn/graph/graph.py

self._session.TryInit()

所以这段24分钟的时间空隙,是 flow.env.init_rdma() 和 graph 创建之间的代码在执行。

应该是build_model的时间消耗,以前和程鹏讨论过,https://github.com/Oneflow-Inc/libai/pull/284#issuecomment-1166807403

strint commented 1 year ago

参数规模特别大,会让什么时间变长特别多呢,tensor 的创建、to_global 操作?

这里参数规模大,应该不是指一个参数的 size 大,而是参数的个数特别多?

CPFLAME commented 1 year ago

参数规模特别大的话, 在build_model阶段, 下面这种语句耗时应该会变长. 因为他的out_feature, 和in_feature会有相应的增加, 再加上大模型的layer层数会变多, 这个语句调用的次数也会特别多.

self.weight = flow.nn.Parameter(
            flow.empty(
                (out_features, in_features),
                dtype=flow.float32,
                placement=dist.get_layer_placement(layer_idx),  # for pipeline parallelism placement
                sbp=weight_sbp,
            )
        )

以及参数初始化的函数可能也会在大模型的情况下增加相应的耗时

nn.init.xavier_normal_(self.weight)
strint commented 1 year ago

参数规模特别大的话, 在build_model阶段, 下面这种语句耗时应该会变长. 因为他的out_feature, 和in_feature会有相应的增加, 再加上大模型的layer层数会变多, 这个语句调用的次数也会特别多.

self.weight = flow.nn.Parameter(
            flow.empty(
                (out_features, in_features),
                dtype=flow.float32,
                placement=dist.get_layer_placement(layer_idx),  # for pipeline parallelism placement
                sbp=weight_sbp,
            )
        )

以及参数初始化的函数可能也会在大模型的情况下增加相应的耗时

nn.init.xavier_normal_(self.weight)

@xiezipeng-ML 复现时可以这里加个时间统计

xiezipeng-ML commented 1 year ago

程鹏、啸宇和我发现是init_method(self.weight)导致build_model很慢

strint commented 1 year ago

一份新的完整日志


Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20221019 14:00:02.101070 1057827 env_global_objects_scope.cpp:167] Using rpc backend: gRPC
I20221019 14:00:02.185784 1057827 ctrl_server.cpp:40] CtrlServer listening on 0.0.0.0:42721
I20221019 14:00:02.186215 1057827 rank_info_bootstrap_server.cpp:47] RankInfoBootstrapServer listening on 0.0.0.0:53005
I20221019 14:00:14.814558 1057827 epoll_comm_network.cpp:63] CommNet:Epoll listening on 0.0.0.0:37471
I20221019 14:05:08.485886 1057827 ibverbs_comm_network.cpp:147] Using IB device mlx5_0 port 1 gid index 0
I20221019 14:05:15.346899 1057827 ibverbs_comm_network.cpp:177] Connected to peer 7
I20221019 14:05:15.351934 1057827 ibverbs_comm_network.cpp:177] Connected to peer 6
I20221019 14:05:15.352705 1057827 ibverbs_comm_network.cpp:177] Connected to peer 5
I20221019 14:05:15.355347 1057827 ibverbs_comm_network.cpp:177] Connected to peer 4
I20221019 14:05:15.356624 1057827 ibverbs_comm_network.cpp:177] Connected to peer 3
I20221019 14:05:15.359282 1057827 ibverbs_comm_network.cpp:177] Connected to peer 2
I20221019 14:05:15.359902 1057827 ibverbs_comm_network.cpp:177] Connected to peer 1
I20221019 15:12:36.915500 1057827 version.cpp:22] OneFlow git version: 63a3b287
I20221019 15:12:36.932379 1057827 cuda_device_manager_factory.cpp:63] CUDA runtime version: 11.2
I20221019 15:12:36.932765 1057827 cuda_device_manager_factory.cpp:72] cuDNN version: 8.1.1
I20221019 15:12:36.932965 1057827 cuda_device_manager_factory.cpp:85] NCCL version: 2.12.10
I20221019 15:12:50.756304 1057827 ibverbs_comm_network.cpp:147] Using IB device mlx5_0 port 1 gid index 0
I20221019 15:12:52.986742 1057827 ibverbs_comm_network.cpp:177] Connected to peer 7
I20221019 15:12:52.987969 1057827 ibverbs_comm_network.cpp:177] Connected to peer 6
I20221019 15:12:52.988981 1057827 ibverbs_comm_network.cpp:177] Connected to peer 5
I20221019 15:12:52.989643 1057827 ibverbs_comm_network.cpp:177] Connected to peer 4
I20221019 15:12:52.993614 1057827 ibverbs_comm_network.cpp:177] Connected to peer 3
I20221019 15:12:52.994283 1057827 ibverbs_comm_network.cpp:177] Connected to peer 2
I20221019 15:12:52.994951 1057827 ibverbs_comm_network.cpp:177] Connected to peer 1

这个部分的开销达到了1个小时

I20221019 14:05:15.359902 1057827 ibverbs_comm_network.cpp:177] Connected to peer 1
I20221019 15:12:36.915500 1057827 version.cpp:22] OneFlow git version: 63a3b287

graph 编译开销 15:14:04~15:33:37,总时间为 19:33

xiezipeng-ML commented 1 year ago
from omegaconf import DictConfig
import time
from libai.utils import distributed as dist
import oneflow as flow

class SuperMLP(flow.nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        self.weight1 = flow.nn.Parameter(
            flow.empty(
                (input_dim, output_dim),
                dtype=flow.float32,
                placement=flow.placement("cuda", list(range(dist.get_world_size()))),  
                sbp=flow.sbp.broadcast,
            )
        )
        self.weight2 = flow.nn.Parameter(
            flow.empty(
                (output_dim, output_dim),
                dtype=flow.float32,
                placement=flow.placement("cuda", list(range(dist.get_world_size()))),  
                sbp=flow.sbp.broadcast,
            )
        )
        self.weight3 = flow.nn.Parameter(
            flow.empty(
                (output_dim, output_dim),
                dtype=flow.float32,
                placement=flow.placement("cuda", list(range(dist.get_world_size()))),  
                sbp=flow.sbp.broadcast,
            )
        )
        flow.nn.init.xavier_normal_(self.weight1)
        flow.nn.init.xavier_normal_(self.weight2)
        flow.nn.init.xavier_normal_(self.weight3)

    def forward(self, x):
        x = flow.matmul(x, self.weight1, transpose_b=True)
        x = flow.matmul(x, self.weight2, transpose_b=True)
        x = flow.matmul(x, self.weight3, transpose_b=True)
        return x

dist.synchronize()
s = time.time()
model = SuperMLP(5000, 5000)
dist.synchronize()
e = time.time()
print('build_model time:', e-s)

@strint @CPFLAME 这个复现行不行

CPFLAME commented 1 year ago

可以描述一下, 把 flow.nn.init.xavier_normal_注释前后的时间变化

xiezipeng-ML commented 1 year ago
from omegaconf import DictConfig
import time
from libai.utils import distributed as dist
import oneflow as flow

class SuperMLP(flow.nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        self.weight1 = flow.nn.Parameter(
            flow.empty(
                (input_dim, output_dim),
                dtype=flow.float32,
                placement=flow.placement("cuda", list(range(dist.get_world_size()))),  
                sbp=flow.sbp.broadcast,
            )
        )
        self.weight2 = flow.nn.Parameter(
            flow.empty(
                (output_dim, output_dim),
                dtype=flow.float32,
                placement=flow.placement("cuda", list(range(dist.get_world_size()))),  
                sbp=flow.sbp.broadcast,
            )
        )
        self.weight3 = flow.nn.Parameter(
            flow.empty(
                (output_dim, output_dim),
                dtype=flow.float32,
                placement=flow.placement("cuda", list(range(dist.get_world_size()))),  
                sbp=flow.sbp.broadcast,
            )
        )
        flow.nn.init.xavier_normal_(self.weight1)
        flow.nn.init.xavier_normal_(self.weight2)
        flow.nn.init.xavier_normal_(self.weight3)

    def forward(self, x):
        x = flow.matmul(x, self.weight1, transpose_b=True)
        x = flow.matmul(x, self.weight2, transpose_b=True)
        x = flow.matmul(x, self.weight3, transpose_b=True)
        return x

dist.synchronize()
s = time.time()
model = SuperMLP(5000, 5000)
dist.synchronize()
e = time.time()
print('build_model time:', e-s)

@strint @CPFLAME 这个复现行不行

可能层数不够多不明显 带初始化时:build_model time: 4.304454565048218 注释初始化后:build_model time: 2.042632579803467

wyg1997 commented 1 year ago

目前的初始化方式是借用 numpy 来生成数据,再用 copy_ 来拷贝到 tensor 上,比直接在 tensor 上初始化要至少多 2 次拷贝:1. ndarray -> GlobalTensor;2. GlobalTensor -> TargetTensor。

之前是有计划的重构 initializer 改用 kernel 实现, 就完全没有多余的拷贝,但由于初始化一般只会做一次,对整体性能影响不大,优先级没有提的那么高。这里在大模型上影响比较大了,可以提高优先级做一下了。我先做一下 uniform_ 系列的初始化算子,看看性能上能提多少,后面再逐步把所有初始化算子重构过来。

strint commented 1 year ago

这里有可以快速绕过去的方法吗,模型初始化的开销达到了1个小时

wyg1997 commented 1 year ago

这里有可以快速绕过去的方法吗,模型初始化的开销达到了1个小时

可以用 flow.normal 代替 flow.empty 来初始化,std 和 mean 需要根据 xavier_normal 的公式来计算一下( https://github.com/oneflow-inc/oneflow/blob/e636732475f642bce0ceb992c3227ecac74b307e/python/oneflow/ops/initializer_register.py#L191-L192

实际模型中可以根据下面例子再封装一下,把 flow.empty 一键替换掉试试。

from omegaconf import DictConfig
import time
from libai.utils import distributed as dist
import oneflow as flow
from math import sqrt

from oneflow.ops.util.initializer_util import calc_fan

generator = flow.Generator()
generator.manual_seed(0)

class SuperMLP(flow.nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        self.weight1 = flow.nn.Parameter(
            flow.normal(
                0.0,
                1.0 / sqrt(2.0 / calc_fan((input_dim, output_dim), "fan_sum", "channels_first")),
                (input_dim, output_dim),
                generator=generator,
                dtype=flow.float32,
                placement=flow.placement("cuda", list(range(dist.get_world_size()))),  
                sbp=flow.sbp.broadcast,
            )
        )
        self.weight2 = flow.nn.Parameter(
            flow.normal(
                0.0,
                1.0 / sqrt(2.0 / calc_fan((output_dim, output_dim), "fan_sum", "channels_first")),
                (output_dim, output_dim),
                generator=generator,
                dtype=flow.float32,
                placement=flow.placement("cuda", list(range(dist.get_world_size()))),  
                sbp=flow.sbp.broadcast,
            )
        )
        self.weight3 = flow.nn.Parameter(
            flow.normal(
                0.0,
                1.0 / sqrt(2.0 / calc_fan((output_dim, output_dim), "fan_sum", "channels_first")),
                (output_dim, output_dim),
                generator=generator,
                dtype=flow.float32,
                placement=flow.placement("cuda", list(range(dist.get_world_size()))),  
                sbp=flow.sbp.broadcast,
            )
        )
        #  flow.nn.init.xavier_normal_(self.weight1)
        #  flow.nn.init.xavier_normal_(self.weight2)
        #  flow.nn.init.xavier_normal_(self.weight3)

    def forward(self, x):
        x = flow.matmul(x, self.weight1, transpose_b=True)
        x = flow.matmul(x, self.weight2, transpose_b=True)
        x = flow.matmul(x, self.weight3, transpose_b=True)
        return x

dist.synchronize()
s = time.time()
model = SuperMLP(5000, 5000)
dist.synchronize()
e = time.time()
print('build_model time:', e-s)
xyn1201 commented 1 year ago

t5 2机4卡测试

xiezipeng-ML commented 1 year ago

用oneflow28跑了一下IDEA的单机8卡配置,build model的时间是132.61724853515625s,应该是图编译慢的问题,但是没编译完就OOM了

image @strint

strint commented 1 year ago

用oneflow28跑了一下IDEA的单机8卡配置

也加下 GLOG_v=1 这个环境变量,然后再跑下吧,之后把 OneFlow.INFO 也发出来看看

chengtbf commented 1 year ago

t5 2机4卡测试

  • 机器:oneflow-25 oneflow-28 2机一共8卡
  • oneflow master Oneflow-Inc/oneflow@93d19f3
  • libai main e9ca408
  • 用例:t5_nl12_nah12_hs768_FP16_actrue_mp2_pp1_mb16_gb512_2n4g zero_stage=2

    • libai:4530MiB/50.92 samples/s / 编译时间7min左右 日志:oss://oneflow-test/libai_vs_megatron/1019_t5/libai_zero/
    • Megatron-deepspeed:4783 MiB/164.3 samples/s / 编译时间<10s

再额外跑一次单机 4 卡的 Megatron 和 oneflow 的性能数据对比

xyn1201 commented 1 year ago

再额外跑一次单机 4 卡的 Megatron 和 oneflow 的性能数据对比

xiezipeng-ML commented 1 year ago

nightly

单机8卡, libai branch: test_t5_time version: 0.8.1.dev20221019+cu112 git_commit: f97f09f1 cmake_build_type: Release rdma: True mlir: True

build model time--------------------------------:133.27645230293274

stable

version: 0.8.0+cu112 git_commit: a6d4cb80 cmake_build_type: Release rdma: True mlir: True

单机8卡, libai branch: test_t5_time build model time--------------------------------:128.69339418411255

strint commented 1 year ago

可以在 libai 中,把 build model 部分的时耗统计加上,然后给 idea 试下。看看是不是稳定复现。

这里统计才2分钟,但是 idea 的日志显示 build model 部分的时耗有24分钟。

strint commented 1 year ago

参数规模特别大的话, 在build_model阶段, 下面这种语句耗时应该会变长. 因为他的out_feature, 和in_feature会有相应的增加, 再加上大模型的layer层数会变多, 这个语句调用的次数也会特别多.

self.weight = flow.nn.Parameter(
            flow.empty(
                (out_features, in_features),
                dtype=flow.float32,
                placement=dist.get_layer_placement(layer_idx),  # for pipeline parallelism placement
                sbp=weight_sbp,
            )
        )

以及参数初始化的函数可能也会在大模型的情况下增加相应的耗时

nn.init.xavier_normal_(self.weight)
nn.init.xavier_normal_(self.weight)

这一小段,加个时耗统计吧,看是不是真是这里的开销

xiezipeng-ML commented 1 year ago

build model这里可能没有上次统计到的24分钟,今天让IDEA复现了一下之前的单机8卡,这里是他们的结果

1df566ff8e648cda2f25a0c0fcb7cfd
Log file created at: 2022/10/24 17:34:34
Running on machine: dgx035.scc.idea
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20221024 17:34:34.962337 1725024 env_global_objects_scope.cpp:161] Using rpc backend: gRPC
I20221024 17:34:35.045975 1725024 ctrl_server.cpp:40] CtrlServer listening on 0.0.0.0:33969
I20221024 17:34:35.057220 1725024 rank_info_bootstrap_server.cpp:47] RankInfoBootstrapServer listening on 0.0.0.0:53005
I20221024 17:34:37.832413 1725024 epoll_comm_network.cpp:63] CommNet:Epoll listening on 0.0.0.0:38241
I20221024 17:39:24.159760 1725024 ibverbs_comm_network.cpp:147] Using IB device mlx5_0 port 1 gid index 0
I20221024 17:39:31.180423 1725024 ibverbs_comm_network.cpp:177] Connected to peer 7
I20221024 17:39:31.182106 1725024 ibverbs_comm_network.cpp:177] Connected to peer 6
I20221024 17:39:31.182943 1725024 ibverbs_comm_network.cpp:177] Connected to peer 5
I20221024 17:39:31.183637 1725024 ibverbs_comm_network.cpp:177] Connected to peer 4
I20221024 17:39:31.184548 1725024 ibverbs_comm_network.cpp:177] Connected to peer 3
I20221024 17:39:31.190455 1725024 ibverbs_comm_network.cpp:177] Connected to peer 2
I20221024 17:39:31.191169 1725024 ibverbs_comm_network.cpp:177] Connected to peer 1
I20221024 17:43:28.173650 1725024 version.cpp:22] OneFlow git version: 22eabed
I20221024 17:43:28.174440 1725024 cuda_device_manager_factory.cpp:63] CUDA runtime version: 11.2
I20221024 17:43:28.174600 1725024 cuda_device_manager_factory.cpp:72] cuDNN version: 8.1.1
I20221024 17:43:28.174736 1725024 cuda_device_manager_factory.cpp:85] NCCL version: 2.12.10
I20221024 17:43:42.918134 1725024 ibverbs_comm_network.cpp:147] Using IB device mlx5_0 port 1 gid index 0
I20221024 17:43:43.090883 1725024 ibverbs_comm_network.cpp:177] Connected to peer 7
I20221024 17:43:44.198565 1725024 ibverbs_comm_network.cpp:177] Connected to peer 6
I20221024 17:43:44.199221 1725024 ibverbs_comm_network.cpp:177] Connected to peer 5
I20221024 17:43:44.199839 1725024 ibverbs_comm_network.cpp:177] Connected to peer 4
I20221024 17:43:44.200453 1725024 ibverbs_comm_network.cpp:177] Connected to peer 3
I20221024 17:43:44.709785 1725024 ibverbs_comm_network.cpp:177] Connected to peer 2
I20221024 17:43:44.710520 1725024 ibverbs_comm_network.cpp:177] Connected to peer 1
I20221024 17:44:55.652488 1725024 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221024 17:45:18.134761 1725024 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221024 17:45:18.135576 1725024 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InsertPinnedIdentityOpPass","mem_rss":"22273.000000 MB","mem_vm":"354918.000000 MB","time_cost":"22 seconds"}
I20221024 17:45:18.135789 1725024 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221024 17:45:41.548008 1725024 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221024 17:45:41.549613 1725024 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 EliminateDeadNodesPass","mem_rss":"22297.000000 MB","mem_vm":"354941.000000 MB","time_cost":"23 seconds"}
I20221024 17:45:41.549818 1725024 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221024 17:45:41.549978 1725024 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221024 17:45:41.550169 1725024 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 NormalizationExponentialAverageAutoTickPass","mem_rss":"22297.000000 MB","mem_vm":"354941.000000 MB","time_cost":"0 seconds"}
I20221024 17:45:41.550320 1725024 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_3-AutoMixedPrecision

啸宇, 这是刚出的结果 @strint

strint commented 1 year ago
I20221024 17:39:31.191169 1725024 ibverbs_comm_network.cpp:177] Connected to peer 1
I20221024 17:43:28.173650 1725024 version.cpp:22] OneFlow git version: 22eabed

你加的日志 和 glog 日志都显示现在是 4 分支的开销。

I20221024 17:39:31.191169 1725024 ibverbs_comm_network.cpp:177] Connected to peer 1
I20221024 17:43:28.173650 1725024 version.cpp:22] OneFlow git version: 22eabed

不过刚说这个没有用之前的参数。所以没有参考价值。

xiezipeng-ML commented 1 year ago

Code:

# /PATH/TO/oneflow/python/oneflow/nn/init.py
# 用@get_time修饰normal_、xavier_normal_

def init_():
    global Time_Counter4Xavier_Normal
    Time_Counter4Xavier_Normal = 0

def get_time_counter():
    return Time_Counter4Xavier_Normal

def synchronize():
    world_size = flow.env.get_world_size()
    if world_size == 1:
        return
    flow.comm.barrier()

def get_time(f):
    def inner(*args, **kwargs):
        global Time_Counter4Xavier_Normal
        synchronize()
        s = time.time()
        res = f(*args, **kwargs)
        e = time.time()
        synchronize()
        Time_Counter4Xavier_Normal = Time_Counter4Xavier_Normal + (e - s)
        return res
    return inner

Result:

oneflow version build model time init method time init method 时间占比
version: 0.8.1.dev20221023+cu112 git_commit: 22eabed cmake_build_type: Release rdma: True mlir: True 4.444961071014404 2.3625545501708984 0.5315129901985237
branch : refactor-normal_initializer version: 0.8.1+cu112.git.4f084665bb git_commit: 4f084665bb cmake_build_type: Release rdma: False mlir: False 2.8537516593933105 0.6972320079803467 0.2443211923102565

@strint @wyg1997 @CPFLAME

strint commented 1 year ago

IDEA 小伙伴反馈,这里 build model 的时间,已经降低到5分钟,之前两次反馈的数据是24分钟和1个小时。

I20221028 14:56:37.125906 2360039 epoll_comm_network.cpp:63] CommNet:Epoll listening on 0.0.0.0:32973
I20221028 15:01:43.440121 2360039 version.cpp:22] OneFlow git version: 5b3a585