Closed strint closed 2 years ago
第12行的 ibverbs 相关的,对应Libai仓库的 flow.env.init_rdma() 操作;
第13行的 version.cpp::22 应该是这里调用的
Maybe<void> MultiClientSessionContext::TryInit(const ConfigProto& config_proto) {
if (!is_inited_) {
DumpVersionInfo();
对应第一个 graph 的初始化时的 Session 初始化 python/oneflow/nn/graph/graph.py
self._session.TryInit()
所以这段24分钟的时间空隙,是 flow.env.init_rdma() 和 graph 创建之间的代码在执行。
第12行的 ibverbs 相关的,对应Libai仓库的 flow.env.init_rdma() 操作;
第13行的 version.cpp::22 应该是这里调用的
Maybe<void> MultiClientSessionContext::TryInit(const ConfigProto& config_proto) { if (!is_inited_) { DumpVersionInfo();
对应第一个 graph 的初始化时的 Session 初始化 python/oneflow/nn/graph/graph.py
self._session.TryInit()
所以这段24分钟的时间空隙,是 flow.env.init_rdma() 和 graph 创建之间的代码在执行。
应该是build_model的时间消耗,以前和程鹏讨论过,https://github.com/Oneflow-Inc/libai/pull/284#issuecomment-1166807403
参数规模特别大,会让什么时间变长特别多呢,tensor 的创建、to_global 操作?
这里参数规模大,应该不是指一个参数的 size 大,而是参数的个数特别多?
参数规模特别大的话, 在build_model阶段, 下面这种语句耗时应该会变长. 因为他的out_feature, 和in_feature会有相应的增加, 再加上大模型的layer层数会变多, 这个语句调用的次数也会特别多.
self.weight = flow.nn.Parameter(
flow.empty(
(out_features, in_features),
dtype=flow.float32,
placement=dist.get_layer_placement(layer_idx), # for pipeline parallelism placement
sbp=weight_sbp,
)
)
以及参数初始化的函数可能也会在大模型的情况下增加相应的耗时
nn.init.xavier_normal_(self.weight)
参数规模特别大的话, 在build_model阶段, 下面这种语句耗时应该会变长. 因为他的out_feature, 和in_feature会有相应的增加, 再加上大模型的layer层数会变多, 这个语句调用的次数也会特别多.
self.weight = flow.nn.Parameter( flow.empty( (out_features, in_features), dtype=flow.float32, placement=dist.get_layer_placement(layer_idx), # for pipeline parallelism placement sbp=weight_sbp, ) )
以及参数初始化的函数可能也会在大模型的情况下增加相应的耗时
nn.init.xavier_normal_(self.weight)
@xiezipeng-ML 复现时可以这里加个时间统计
程鹏、啸宇和我发现是init_method(self.weight)导致build_model很慢
一份新的完整日志
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20221019 14:00:02.101070 1057827 env_global_objects_scope.cpp:167] Using rpc backend: gRPC
I20221019 14:00:02.185784 1057827 ctrl_server.cpp:40] CtrlServer listening on 0.0.0.0:42721
I20221019 14:00:02.186215 1057827 rank_info_bootstrap_server.cpp:47] RankInfoBootstrapServer listening on 0.0.0.0:53005
I20221019 14:00:14.814558 1057827 epoll_comm_network.cpp:63] CommNet:Epoll listening on 0.0.0.0:37471
I20221019 14:05:08.485886 1057827 ibverbs_comm_network.cpp:147] Using IB device mlx5_0 port 1 gid index 0
I20221019 14:05:15.346899 1057827 ibverbs_comm_network.cpp:177] Connected to peer 7
I20221019 14:05:15.351934 1057827 ibverbs_comm_network.cpp:177] Connected to peer 6
I20221019 14:05:15.352705 1057827 ibverbs_comm_network.cpp:177] Connected to peer 5
I20221019 14:05:15.355347 1057827 ibverbs_comm_network.cpp:177] Connected to peer 4
I20221019 14:05:15.356624 1057827 ibverbs_comm_network.cpp:177] Connected to peer 3
I20221019 14:05:15.359282 1057827 ibverbs_comm_network.cpp:177] Connected to peer 2
I20221019 14:05:15.359902 1057827 ibverbs_comm_network.cpp:177] Connected to peer 1
I20221019 15:12:36.915500 1057827 version.cpp:22] OneFlow git version: 63a3b287
I20221019 15:12:36.932379 1057827 cuda_device_manager_factory.cpp:63] CUDA runtime version: 11.2
I20221019 15:12:36.932765 1057827 cuda_device_manager_factory.cpp:72] cuDNN version: 8.1.1
I20221019 15:12:36.932965 1057827 cuda_device_manager_factory.cpp:85] NCCL version: 2.12.10
I20221019 15:12:50.756304 1057827 ibverbs_comm_network.cpp:147] Using IB device mlx5_0 port 1 gid index 0
I20221019 15:12:52.986742 1057827 ibverbs_comm_network.cpp:177] Connected to peer 7
I20221019 15:12:52.987969 1057827 ibverbs_comm_network.cpp:177] Connected to peer 6
I20221019 15:12:52.988981 1057827 ibverbs_comm_network.cpp:177] Connected to peer 5
I20221019 15:12:52.989643 1057827 ibverbs_comm_network.cpp:177] Connected to peer 4
I20221019 15:12:52.993614 1057827 ibverbs_comm_network.cpp:177] Connected to peer 3
I20221019 15:12:52.994283 1057827 ibverbs_comm_network.cpp:177] Connected to peer 2
I20221019 15:12:52.994951 1057827 ibverbs_comm_network.cpp:177] Connected to peer 1
这个部分的开销达到了1个小时
I20221019 14:05:15.359902 1057827 ibverbs_comm_network.cpp:177] Connected to peer 1
I20221019 15:12:36.915500 1057827 version.cpp:22] OneFlow git version: 63a3b287
graph 编译开销 15:14:04~15:33:37,总时间为 19:33
from omegaconf import DictConfig
import time
from libai.utils import distributed as dist
import oneflow as flow
class SuperMLP(flow.nn.Module):
def __init__(self, input_dim, output_dim):
super().__init__()
self.weight1 = flow.nn.Parameter(
flow.empty(
(input_dim, output_dim),
dtype=flow.float32,
placement=flow.placement("cuda", list(range(dist.get_world_size()))),
sbp=flow.sbp.broadcast,
)
)
self.weight2 = flow.nn.Parameter(
flow.empty(
(output_dim, output_dim),
dtype=flow.float32,
placement=flow.placement("cuda", list(range(dist.get_world_size()))),
sbp=flow.sbp.broadcast,
)
)
self.weight3 = flow.nn.Parameter(
flow.empty(
(output_dim, output_dim),
dtype=flow.float32,
placement=flow.placement("cuda", list(range(dist.get_world_size()))),
sbp=flow.sbp.broadcast,
)
)
flow.nn.init.xavier_normal_(self.weight1)
flow.nn.init.xavier_normal_(self.weight2)
flow.nn.init.xavier_normal_(self.weight3)
def forward(self, x):
x = flow.matmul(x, self.weight1, transpose_b=True)
x = flow.matmul(x, self.weight2, transpose_b=True)
x = flow.matmul(x, self.weight3, transpose_b=True)
return x
dist.synchronize()
s = time.time()
model = SuperMLP(5000, 5000)
dist.synchronize()
e = time.time()
print('build_model time:', e-s)
@strint @CPFLAME 这个复现行不行
可以描述一下, 把 flow.nn.init.xavier_normal_注释前后的时间变化
from omegaconf import DictConfig import time from libai.utils import distributed as dist import oneflow as flow class SuperMLP(flow.nn.Module): def __init__(self, input_dim, output_dim): super().__init__() self.weight1 = flow.nn.Parameter( flow.empty( (input_dim, output_dim), dtype=flow.float32, placement=flow.placement("cuda", list(range(dist.get_world_size()))), sbp=flow.sbp.broadcast, ) ) self.weight2 = flow.nn.Parameter( flow.empty( (output_dim, output_dim), dtype=flow.float32, placement=flow.placement("cuda", list(range(dist.get_world_size()))), sbp=flow.sbp.broadcast, ) ) self.weight3 = flow.nn.Parameter( flow.empty( (output_dim, output_dim), dtype=flow.float32, placement=flow.placement("cuda", list(range(dist.get_world_size()))), sbp=flow.sbp.broadcast, ) ) flow.nn.init.xavier_normal_(self.weight1) flow.nn.init.xavier_normal_(self.weight2) flow.nn.init.xavier_normal_(self.weight3) def forward(self, x): x = flow.matmul(x, self.weight1, transpose_b=True) x = flow.matmul(x, self.weight2, transpose_b=True) x = flow.matmul(x, self.weight3, transpose_b=True) return x dist.synchronize() s = time.time() model = SuperMLP(5000, 5000) dist.synchronize() e = time.time() print('build_model time:', e-s)
@strint @CPFLAME 这个复现行不行
可能层数不够多不明显 带初始化时:build_model time: 4.304454565048218 注释初始化后:build_model time: 2.042632579803467
目前的初始化方式是借用 numpy 来生成数据,再用 copy_ 来拷贝到 tensor 上,比直接在 tensor 上初始化要至少多 2 次拷贝:1. ndarray -> GlobalTensor;2. GlobalTensor -> TargetTensor。
之前是有计划的重构 initializer 改用 kernel 实现, 就完全没有多余的拷贝,但由于初始化一般只会做一次,对整体性能影响不大,优先级没有提的那么高。这里在大模型上影响比较大了,可以提高优先级做一下了。我先做一下 uniform_ 系列的初始化算子,看看性能上能提多少,后面再逐步把所有初始化算子重构过来。
这里有可以快速绕过去的方法吗,模型初始化的开销达到了1个小时
这里有可以快速绕过去的方法吗,模型初始化的开销达到了1个小时
可以用 flow.normal 代替 flow.empty 来初始化,std 和 mean 需要根据 xavier_normal 的公式来计算一下( https://github.com/oneflow-inc/oneflow/blob/e636732475f642bce0ceb992c3227ecac74b307e/python/oneflow/ops/initializer_register.py#L191-L192 )
实际模型中可以根据下面例子再封装一下,把 flow.empty 一键替换掉试试。
from omegaconf import DictConfig
import time
from libai.utils import distributed as dist
import oneflow as flow
from math import sqrt
from oneflow.ops.util.initializer_util import calc_fan
generator = flow.Generator()
generator.manual_seed(0)
class SuperMLP(flow.nn.Module):
def __init__(self, input_dim, output_dim):
super().__init__()
self.weight1 = flow.nn.Parameter(
flow.normal(
0.0,
1.0 / sqrt(2.0 / calc_fan((input_dim, output_dim), "fan_sum", "channels_first")),
(input_dim, output_dim),
generator=generator,
dtype=flow.float32,
placement=flow.placement("cuda", list(range(dist.get_world_size()))),
sbp=flow.sbp.broadcast,
)
)
self.weight2 = flow.nn.Parameter(
flow.normal(
0.0,
1.0 / sqrt(2.0 / calc_fan((output_dim, output_dim), "fan_sum", "channels_first")),
(output_dim, output_dim),
generator=generator,
dtype=flow.float32,
placement=flow.placement("cuda", list(range(dist.get_world_size()))),
sbp=flow.sbp.broadcast,
)
)
self.weight3 = flow.nn.Parameter(
flow.normal(
0.0,
1.0 / sqrt(2.0 / calc_fan((output_dim, output_dim), "fan_sum", "channels_first")),
(output_dim, output_dim),
generator=generator,
dtype=flow.float32,
placement=flow.placement("cuda", list(range(dist.get_world_size()))),
sbp=flow.sbp.broadcast,
)
)
# flow.nn.init.xavier_normal_(self.weight1)
# flow.nn.init.xavier_normal_(self.weight2)
# flow.nn.init.xavier_normal_(self.weight3)
def forward(self, x):
x = flow.matmul(x, self.weight1, transpose_b=True)
x = flow.matmul(x, self.weight2, transpose_b=True)
x = flow.matmul(x, self.weight3, transpose_b=True)
return x
dist.synchronize()
s = time.time()
model = SuperMLP(5000, 5000)
dist.synchronize()
e = time.time()
print('build_model time:', e-s)
用例:t5_nl12_nah12_hs768_FP16_actrue_mp2_pp1_mb16_gb512_2n4g
zero_stage=2
用oneflow28跑了一下IDEA的单机8卡配置,build model的时间是132.61724853515625s
,应该是图编译慢的问题,但是没编译完就OOM了
@strint
用oneflow28跑了一下IDEA的单机8卡配置
也加下 GLOG_v=1 这个环境变量,然后再跑下吧,之后把 OneFlow.INFO 也发出来看看
t5 2机4卡测试
- 机器:oneflow-25 oneflow-28 2机一共8卡
- oneflow master Oneflow-Inc/oneflow@93d19f3
- libai main e9ca408
用例:
t5_nl12_nah12_hs768_FP16_actrue_mp2_pp1_mb16_gb512_2n4g
zero_stage=2
再额外跑一次单机 4 卡的 Megatron 和 oneflow 的性能数据对比
再额外跑一次单机 4 卡的 Megatron 和 oneflow 的性能数据对比
用例:t5_nl12_nah12_hs768_FP16_actrue_mp2_pp1_mb32_gb512_1n4g
zero_stage=2
单机8卡, libai branch: test_t5_time version: 0.8.1.dev20221019+cu112 git_commit: f97f09f1 cmake_build_type: Release rdma: True mlir: True
build model time--------------------------------:133.27645230293274
version: 0.8.0+cu112 git_commit: a6d4cb80 cmake_build_type: Release rdma: True mlir: True
单机8卡, libai branch: test_t5_time build model time--------------------------------:128.69339418411255
可以在 libai 中,把 build model 部分的时耗统计加上,然后给 idea 试下。看看是不是稳定复现。
这里统计才2分钟,但是 idea 的日志显示 build model 部分的时耗有24分钟。
参数规模特别大的话, 在build_model阶段, 下面这种语句耗时应该会变长. 因为他的out_feature, 和in_feature会有相应的增加, 再加上大模型的layer层数会变多, 这个语句调用的次数也会特别多.
self.weight = flow.nn.Parameter( flow.empty( (out_features, in_features), dtype=flow.float32, placement=dist.get_layer_placement(layer_idx), # for pipeline parallelism placement sbp=weight_sbp, ) )
以及参数初始化的函数可能也会在大模型的情况下增加相应的耗时
nn.init.xavier_normal_(self.weight)
nn.init.xavier_normal_(self.weight)
这一小段,加个时耗统计吧,看是不是真是这里的开销
build model这里可能没有上次统计到的24分钟,今天让IDEA复现了一下之前的单机8卡,这里是他们的结果
Log file created at: 2022/10/24 17:34:34
Running on machine: dgx035.scc.idea
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20221024 17:34:34.962337 1725024 env_global_objects_scope.cpp:161] Using rpc backend: gRPC
I20221024 17:34:35.045975 1725024 ctrl_server.cpp:40] CtrlServer listening on 0.0.0.0:33969
I20221024 17:34:35.057220 1725024 rank_info_bootstrap_server.cpp:47] RankInfoBootstrapServer listening on 0.0.0.0:53005
I20221024 17:34:37.832413 1725024 epoll_comm_network.cpp:63] CommNet:Epoll listening on 0.0.0.0:38241
I20221024 17:39:24.159760 1725024 ibverbs_comm_network.cpp:147] Using IB device mlx5_0 port 1 gid index 0
I20221024 17:39:31.180423 1725024 ibverbs_comm_network.cpp:177] Connected to peer 7
I20221024 17:39:31.182106 1725024 ibverbs_comm_network.cpp:177] Connected to peer 6
I20221024 17:39:31.182943 1725024 ibverbs_comm_network.cpp:177] Connected to peer 5
I20221024 17:39:31.183637 1725024 ibverbs_comm_network.cpp:177] Connected to peer 4
I20221024 17:39:31.184548 1725024 ibverbs_comm_network.cpp:177] Connected to peer 3
I20221024 17:39:31.190455 1725024 ibverbs_comm_network.cpp:177] Connected to peer 2
I20221024 17:39:31.191169 1725024 ibverbs_comm_network.cpp:177] Connected to peer 1
I20221024 17:43:28.173650 1725024 version.cpp:22] OneFlow git version: 22eabed
I20221024 17:43:28.174440 1725024 cuda_device_manager_factory.cpp:63] CUDA runtime version: 11.2
I20221024 17:43:28.174600 1725024 cuda_device_manager_factory.cpp:72] cuDNN version: 8.1.1
I20221024 17:43:28.174736 1725024 cuda_device_manager_factory.cpp:85] NCCL version: 2.12.10
I20221024 17:43:42.918134 1725024 ibverbs_comm_network.cpp:147] Using IB device mlx5_0 port 1 gid index 0
I20221024 17:43:43.090883 1725024 ibverbs_comm_network.cpp:177] Connected to peer 7
I20221024 17:43:44.198565 1725024 ibverbs_comm_network.cpp:177] Connected to peer 6
I20221024 17:43:44.199221 1725024 ibverbs_comm_network.cpp:177] Connected to peer 5
I20221024 17:43:44.199839 1725024 ibverbs_comm_network.cpp:177] Connected to peer 4
I20221024 17:43:44.200453 1725024 ibverbs_comm_network.cpp:177] Connected to peer 3
I20221024 17:43:44.709785 1725024 ibverbs_comm_network.cpp:177] Connected to peer 2
I20221024 17:43:44.710520 1725024 ibverbs_comm_network.cpp:177] Connected to peer 1
I20221024 17:44:55.652488 1725024 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221024 17:45:18.134761 1725024 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221024 17:45:18.135576 1725024 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InsertPinnedIdentityOpPass","mem_rss":"22273.000000 MB","mem_vm":"354918.000000 MB","time_cost":"22 seconds"}
I20221024 17:45:18.135789 1725024 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221024 17:45:41.548008 1725024 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221024 17:45:41.549613 1725024 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 EliminateDeadNodesPass","mem_rss":"22297.000000 MB","mem_vm":"354941.000000 MB","time_cost":"23 seconds"}
I20221024 17:45:41.549818 1725024 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221024 17:45:41.549978 1725024 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221024 17:45:41.550169 1725024 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 NormalizationExponentialAverageAutoTickPass","mem_rss":"22297.000000 MB","mem_vm":"354941.000000 MB","time_cost":"0 seconds"}
I20221024 17:45:41.550320 1725024 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_3-AutoMixedPrecision
啸宇, 这是刚出的结果 @strint
I20221024 17:39:31.191169 1725024 ibverbs_comm_network.cpp:177] Connected to peer 1 I20221024 17:43:28.173650 1725024 version.cpp:22] OneFlow git version: 22eabed
你加的日志 和 glog 日志都显示现在是 4 分支的开销。
I20221024 17:39:31.191169 1725024 ibverbs_comm_network.cpp:177] Connected to peer 1
I20221024 17:43:28.173650 1725024 version.cpp:22] OneFlow git version: 22eabed
不过刚说这个没有用之前的参数。所以没有参考价值。
# /PATH/TO/oneflow/python/oneflow/nn/init.py
# 用@get_time修饰normal_、xavier_normal_
def init_():
global Time_Counter4Xavier_Normal
Time_Counter4Xavier_Normal = 0
def get_time_counter():
return Time_Counter4Xavier_Normal
def synchronize():
world_size = flow.env.get_world_size()
if world_size == 1:
return
flow.comm.barrier()
def get_time(f):
def inner(*args, **kwargs):
global Time_Counter4Xavier_Normal
synchronize()
s = time.time()
res = f(*args, **kwargs)
e = time.time()
synchronize()
Time_Counter4Xavier_Normal = Time_Counter4Xavier_Normal + (e - s)
return res
return inner
oneflow version | build model time | init method time | init method 时间占比 |
---|---|---|---|
version: 0.8.1.dev20221023+cu112 git_commit: 22eabed cmake_build_type: Release rdma: True mlir: True | 4.444961071014404 | 2.3625545501708984 | 0.5315129901985237 |
branch : refactor-normal_initializer version: 0.8.1+cu112.git.4f084665bb git_commit: 4f084665bb cmake_build_type: Release rdma: False mlir: False | 2.8537516593933105 | 0.6972320079803467 | 0.2443211923102565 |
@strint @wyg1997 @CPFLAME
IDEA 小伙伴反馈,这里 build model 的时间,已经降低到5分钟,之前两次反馈的数据是24分钟和1个小时。
I20221028 14:56:37.125906 2360039 epoll_comm_network.cpp:63] CommNet:Epoll listening on 0.0.0.0:32973
I20221028 15:01:43.440121 2360039 version.cpp:22] OneFlow git version: 5b3a585
git branch:https://github.com/Oneflow-Inc/oneflow/pull/9245
安装方法: python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/release/compile_cost_cnt/cu112
下面是开了 GLOG_v=1 时记录的日志:
可以看到,在文本的第12和13行之间,间隔了24分钟。