Oneflow-Inc / libai

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
https://libai.readthedocs.io
Apache License 2.0
391 stars 55 forks source link

[Bug][MT5] graph compile time #407

Closed strint closed 2 years ago

strint commented 2 years ago

8 卡的编译日志

graph 编译开销 15:14:04~15:33:37,总时间为 19:33

从这个日志看,主要的开销在 logical graph 的优化部分,开销有 1032 seconds,对应17.2分钟。

需要本地复现下这个问题。

I20221019 15:14:04.223161 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221019 15:14:26.565176 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221019 15:14:26.566160 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InsertPinnedIdentityOpPass","mem_rss":"20603.000000 MB","mem_vm":"354915.000000 MB","time_cost":"22 seconds"}
I20221019 15:14:26.566430 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221019 15:14:49.790587 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221019 15:14:49.791502 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 EliminateDeadNodesPass","mem_rss":"20627.000000 MB","mem_vm":"354939.000000 MB","time_cost":"23 seconds"}
I20221019 15:14:49.791702 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221019 15:14:49.792023 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221019 15:14:49.792244 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 NormalizationExponentialAverageAutoTickPass","mem_rss":"20627.000000 MB","mem_vm":"354939.000000 MB","time_cost":"0 seconds"}
I20221019 15:14:49.792395 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_3-AutoMixedPrecision
I20221019 15:15:13.044384 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_3-AutoMixedPrecision
I20221019 15:15:13.046296 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoMixedPrecision","mem_rss":"20627.000000 MB","mem_vm":"354939.000000 MB","time_cost":"23 seconds"}
I20221019 15:15:13.046572 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_4-PruneAmpWhiteIdentityOpPass
I20221019 15:15:38.542693 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_4-PruneAmpWhiteIdentityOpPass
I20221019 15:15:38.545214 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneAmpWhiteIdentityOpPass","mem_rss":"20647.000000 MB","mem_vm":"354958.000000 MB","time_cost":"25 seconds"}
I20221019 15:15:38.545482 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_5-OptimizerPlacementOptimizationPass
I20221019 15:16:03.916712 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_5-OptimizerPlacementOptimizationPass
I20221019 15:16:03.933018 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 OptimizerPlacementOptimizationPass","mem_rss":"20647.000000 MB","mem_vm":"354958.000000 MB","time_cost":"25 seconds"}
I20221019 15:16:03.933349 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_6-FuseAddToOutputPass
I20221019 15:16:28.557394 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_6-FuseAddToOutputPass
I20221019 15:16:28.559181 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseAddToOutputPass","mem_rss":"20668.000000 MB","mem_vm":"354979.000000 MB","time_cost":"24 seconds"}
I20221019 15:16:28.559403 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_7-IRRoundTripBeforeAD
I20221019 15:16:28.559756 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_7-IRRoundTripBeforeAD
I20221019 15:16:28.561041 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 IRRoundTripBeforeAD","mem_rss":"20668.000000 MB","mem_vm":"354979.000000 MB","time_cost":"0 seconds"}
I20221019 15:16:28.561269 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_8-DynamicLossScaleSchedulePass
I20221019 15:16:53.677692 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_8-DynamicLossScaleSchedulePass
I20221019 15:16:53.679533 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DynamicLossScaleSchedulePass","mem_rss":"20668.000000 MB","mem_vm":"354979.000000 MB","time_cost":"25 seconds"}
I20221019 15:16:53.679738 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_9-AutoTrainStep
I20221019 15:17:18.846604 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_9-AutoTrainStep
I20221019 15:17:18.849817 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoTrainStep","mem_rss":"20668.000000 MB","mem_vm":"354979.000000 MB","time_cost":"25 seconds"}
I20221019 15:17:18.850050 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_10-AutoLearningRate
I20221019 15:17:44.180553 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_10-AutoLearningRate
I20221019 15:17:44.184007 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoLearningRate","mem_rss":"20668.000000 MB","mem_vm":"354979.000000 MB","time_cost":"25 seconds"}
I20221019 15:17:44.184332 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_11-QuantAwareTraining
I20221019 15:17:44.184722 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_11-QuantAwareTraining
I20221019 15:17:44.187597 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 QuantAwareTraining","mem_rss":"20668.000000 MB","mem_vm":"354979.000000 MB","time_cost":"0 seconds"}
I20221019 15:17:44.187764 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_12-GenerateOptimizerOpConfs
I20221019 15:18:10.183295 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_12-GenerateOptimizerOpConfs
I20221019 15:18:10.187543 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenerateOptimizerOpConfs","mem_rss":"20668.000000 MB","mem_vm":"354979.000000 MB","time_cost":"25 seconds"}
I20221019 15:18:10.187813 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_13-PrunePinnedIdentityOpPass
I20221019 15:21:17.603493 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_13-PrunePinnedIdentityOpPass
I20221019 15:21:17.604673 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PrunePinnedIdentityOpPass","mem_rss":"20724.000000 MB","mem_vm":"355035.000000 MB","time_cost":"187 seconds"}
I20221019 15:21:17.604955 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_14-ReplaceEmbeddingOps
I20221019 15:21:53.796334 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_14-ReplaceEmbeddingOps
I20221019 15:21:53.799883 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 ReplaceEmbeddingOps","mem_rss":"20724.000000 MB","mem_vm":"355035.000000 MB","time_cost":"36 seconds"}
I20221019 15:21:53.800107 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_15-SequentialOneEmbeddingOpsPass
I20221019 15:21:53.800293 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_15-SequentialOneEmbeddingOpsPass
I20221019 15:21:53.803215 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SequentialOneEmbeddingOpsPass","mem_rss":"20724.000000 MB","mem_vm":"355035.000000 MB","time_cost":"0 seconds"}
I20221019 15:21:53.803361 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_16-FuseEmbeddingShuffleInteractionPass
I20221019 15:21:53.803520 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_16-FuseEmbeddingShuffleInteractionPass
I20221019 15:21:53.806424 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseEmbeddingShuffleInteractionPass","mem_rss":"20724.000000 MB","mem_vm":"355035.000000 MB","time_cost":"0 seconds"}
I20221019 15:21:53.806623 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_17-FuseBCEReduceMeanFwBwPass
I20221019 15:21:53.806780 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_17-FuseBCEReduceMeanFwBwPass
I20221019 15:21:53.809738 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseBCEReduceMeanFwBwPass","mem_rss":"20724.000000 MB","mem_vm":"355035.000000 MB","time_cost":"0 seconds"}
I20221019 15:21:53.809901 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_18-AddSspVariableProxy
I20221019 15:21:53.810178 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_18-AddSspVariableProxy
I20221019 15:21:53.813060 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddSspVariableProxy","mem_rss":"20724.000000 MB","mem_vm":"355035.000000 MB","time_cost":"0 seconds"}
I20221019 15:21:53.813227 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_19-CheckpointingPass
I20221019 15:22:29.968551 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_19-CheckpointingPass
I20221019 15:22:29.971807 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CheckpointingPass","mem_rss":"20724.000000 MB","mem_vm":"355035.000000 MB","time_cost":"36 seconds"}
I20221019 15:22:29.972064 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_20-CudnnFusedNormalizationAddReluPass
I20221019 15:22:29.980051 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_20-CudnnFusedNormalizationAddReluPass
I20221019 15:22:29.982761 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CudnnFusedNormalizationAddReluPass","mem_rss":"20724.000000 MB","mem_vm":"355035.000000 MB","time_cost":"0 seconds"}
I20221019 15:22:29.982971 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_21-PruneCastToStaticShapeOpsPass
I20221019 15:23:06.603806 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_21-PruneCastToStaticShapeOpsPass
I20221019 15:23:06.607353 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneCastToStaticShapeOpsPass","mem_rss":"20724.000000 MB","mem_vm":"355035.000000 MB","time_cost":"36 seconds"}
I20221019 15:23:06.607604 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_22-IRRoundTrip
I20221019 15:23:06.607862 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_22-IRRoundTrip
I20221019 15:23:06.610800 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 IRRoundTrip","mem_rss":"20724.000000 MB","mem_vm":"355035.000000 MB","time_cost":"0 seconds"}
I20221019 15:23:06.610944 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_23-FuseAddToOutputPass1
I20221019 15:23:42.347548 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_23-FuseAddToOutputPass1
I20221019 15:23:42.349567 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseAddToOutputPass","mem_rss":"20736.000000 MB","mem_vm":"355047.000000 MB","time_cost":"35 seconds"}
I20221019 15:23:42.349781 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_24-FuseConsecutiveAddPass
I20221019 15:27:14.090004 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_24-FuseConsecutiveAddPass
I20221019 15:27:14.092053 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseConsecutiveAddPass","mem_rss":"20736.000000 MB","mem_vm":"355047.000000 MB","time_cost":"211 seconds"}
I20221019 15:27:14.092269 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_25-IndexedSlicesOptimizerRewritePass
I20221019 15:27:14.092485 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_25-IndexedSlicesOptimizerRewritePass
I20221019 15:27:14.094110 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 IndexedSlicesOptimizerRewritePass","mem_rss":"20736.000000 MB","mem_vm":"355047.000000 MB","time_cost":"0 seconds"}
I20221019 15:27:14.094267 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_26-SplitSparseSoftmaxCrossEntropyOpPass
I20221019 15:27:49.477536 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_26-SplitSparseSoftmaxCrossEntropyOpPass
I20221019 15:27:49.481253 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SplitSparseSoftmaxCrossEntropyOpPass","mem_rss":"20736.000000 MB","mem_vm":"355047.000000 MB","time_cost":"35 seconds"}
I20221019 15:27:49.494354 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_27-DoParallelCastBeforeWideningTypeCast
I20221019 15:28:24.853873 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_27-DoParallelCastBeforeWideningTypeCast
I20221019 15:28:24.855937 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DoParallelCastBeforeWideningTypeCast","mem_rss":"20736.000000 MB","mem_vm":"355047.000000 MB","time_cost":"35 seconds"}
I20221019 15:28:24.856122 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_28-FuseCastScalePass
I20221019 15:28:59.874688 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_28-FuseCastScalePass
I20221019 15:28:59.876689 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseCastScalePass","mem_rss":"20736.000000 MB","mem_vm":"355047.000000 MB","time_cost":"35 seconds"}
I20221019 15:28:59.876899 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_29-PruneParallelCastOpsPass
I20221019 15:29:34.523111 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_29-PruneParallelCastOpsPass
I20221019 15:29:34.525197 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneParallelCastOpsPass","mem_rss":"20736.000000 MB","mem_vm":"355047.000000 MB","time_cost":"34 seconds"}
I20221019 15:29:34.525369 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_30-FuseUpdateOpsPass
I20221019 15:30:03.374408 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_30-FuseUpdateOpsPass
I20221019 15:30:03.376422 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseUpdateOpsPass","mem_rss":"20736.000000 MB","mem_vm":"355047.000000 MB","time_cost":"28 seconds"}
I20221019 15:30:03.376611 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_31-FuseModelUpdateCastOpsPass
I20221019 15:30:03.376916 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_31-FuseModelUpdateCastOpsPass
I20221019 15:30:03.378465 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseModelUpdateCastOpsPass","mem_rss":"20736.000000 MB","mem_vm":"355047.000000 MB","time_cost":"0 seconds"}
I20221019 15:30:03.378621 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_32-MultiTensorModelUpdatePass
I20221019 15:30:03.378783 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_32-MultiTensorModelUpdatePass
I20221019 15:30:03.380362 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiTensorModelUpdatePass","mem_rss":"20736.000000 MB","mem_vm":"355047.000000 MB","time_cost":"0 seconds"}
I20221019 15:30:03.380539 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_33-FixPipelineStageIdPass
I20221019 15:30:27.783113 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_33-FixPipelineStageIdPass
I20221019 15:30:27.785177 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FixPipelineStageIdPass","mem_rss":"20736.000000 MB","mem_vm":"355047.000000 MB","time_cost":"24 seconds"}
I20221019 15:30:27.785351 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_34-PipelineBufferPass
I20221019 15:30:52.037604 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_34-PipelineBufferPass
I20221019 15:30:52.039598 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PipelineBufferPass","mem_rss":"20736.000000 MB","mem_vm":"355047.000000 MB","time_cost":"24 seconds"}
I20221019 15:30:52.039783 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_35-AutoParallelPass
I20221019 15:30:52.039973 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_35-AutoParallelPass
I20221019 15:30:52.041648 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoParallelPass","mem_rss":"20736.000000 MB","mem_vm":"355047.000000 MB","time_cost":"0 seconds"}
I20221019 15:30:52.041815 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_36-DumpVariableInfoPass
I20221019 15:30:52.041966 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_36-DumpVariableInfoPass
I20221019 15:30:52.043444 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpVariableInfoPass","mem_rss":"20736.000000 MB","mem_vm":"355047.000000 MB","time_cost":"0 seconds"}
I20221019 15:30:52.043601 1057827 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_37-DumpBlobParallelConfPass
I20221019 15:31:16.455482 1057827 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_37-DumpBlobParallelConfPass
I20221019 15:31:16.457499 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"20736.000000 MB","mem_vm":"355047.000000 MB","time_cost":"24 seconds"}
I20221019 15:31:16.507730 1057827 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 OptimizationLogicalGraph","mem_rss":"20736.000000 MB","mem_vm":"355047.000000 MB","time_cost":"1032 seconds"}
I20221019 15:31:46.462813 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GroupBoxingByDstParallel","mem_rss":"20860.000000 MB","mem_vm":"355169.000000 MB","time_cost":"5 seconds"}
I20221019 15:32:16.583578 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 BoxingWithMiddleNodes","mem_rss":"20860.000000 MB","mem_vm":"355169.000000 MB","time_cost":"30 seconds"}
I20221019 15:32:20.752637 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SetCtrlInOpName4VariableOp","mem_rss":"20861.000000 MB","mem_vm":"355171.000000 MB","time_cost":"4 seconds"}
I20221019 15:32:25.366739 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoPrependTick","mem_rss":"20861.000000 MB","mem_vm":"355171.000000 MB","time_cost":"4 seconds"}
I20221019 15:32:29.429275 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddTickForTimeShape","mem_rss":"20862.000000 MB","mem_vm":"355171.000000 MB","time_cost":"4 seconds"}
I20221019 15:32:34.392778 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiClientAutoSourceAndSinkTick","mem_rss":"20872.000000 MB","mem_vm":"355181.000000 MB","time_cost":"4 seconds"}
I20221019 15:32:38.480216 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiClientAutoInterfaceCriticalSectionTick","mem_rss":"20872.000000 MB","mem_vm":"355181.000000 MB","time_cost":"4 seconds"}
I20221019 15:32:38.482362 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SystemOpFillJobNamePass","mem_rss":"20872.000000 MB","mem_vm":"355181.000000 MB","time_cost":"0 seconds"}
I20221019 15:32:42.682269 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"20873.000000 MB","mem_vm":"355181.000000 MB","time_cost":"4 seconds"}
I20221019 15:32:49.726897 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InsertNcclLogicalOpPass","mem_rss":"20924.000000 MB","mem_vm":"355233.000000 MB","time_cost":"7 seconds"}
I20221019 15:32:54.437355 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"20924.000000 MB","mem_vm":"355233.000000 MB","time_cost":"4 seconds"}
I20221019 15:32:59.195753 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CheckOpGraph","mem_rss":"20924.000000 MB","mem_vm":"355233.000000 MB","time_cost":"4 seconds"}
I20221019 15:32:59.197530 1057827 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 CompleteJob","mem_rss":"20924.000000 MB","mem_vm":"355233.000000 MB","time_cost":"86 seconds"}
I20221019 15:33:03.183274 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 NewOpGraph","mem_rss":"20924.000000 MB","mem_vm":"355233.000000 MB","time_cost":"3 seconds"}
I20221019 15:33:18.505702 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 BuildTaskGraph","mem_rss":"21480.000000 MB","mem_vm":"355790.000000 MB","time_cost":"15 seconds"}
I20221019 15:33:23.862709 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddTaskToPlan","mem_rss":"22128.000000 MB","mem_vm":"374224.000000 MB","time_cost":"5 seconds"}
I20221019 15:33:26.148507 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InferMemShare","mem_rss":"22143.000000 MB","mem_vm":"374768.000000 MB","time_cost":"2 seconds"}
I20221019 15:33:28.872982 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenMemBlockAndChunk","mem_rss":"22176.000000 MB","mem_vm":"372719.000000 MB","time_cost":"2 seconds"}
I20221019 15:33:28.875190 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 LogPlan","mem_rss":"22176.000000 MB","mem_vm":"372719.000000 MB","time_cost":"0 seconds"}
I20221019 15:33:29.021628 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenRegisterHint","mem_rss":"22176.000000 MB","mem_vm":"372719.000000 MB","time_cost":"0 seconds"}
I20221019 15:33:29.027832 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenCollectiveBoxingPlan","mem_rss":"22176.000000 MB","mem_vm":"372719.000000 MB","time_cost":"0 seconds"}
I20221019 15:33:29.203078 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpCtrlRegstInfoToPlan","mem_rss":"22176.000000 MB","mem_vm":"372719.000000 MB","time_cost":"0 seconds"}
I20221019 15:33:29.542577 1057827 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 0, Device: 0 needs to allocate [ 12763.8 MiB ] device memory. 
   In general, Chunk id: 6  memory is [ 5422.6 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6044 MiB ] with mem_block_num = 9412
        Eager Variable Tensor total memory is [ 7328.55 MiB ] with mem_block_num = 2250
I20221019 15:33:29.547951 1057827 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 1, Device: 1 needs to allocate [ 12763.8 MiB ] device memory. 
   In general, Chunk id: 4  memory is [ 5422.6 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6054 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7328.55 MiB ] with mem_block_num = 2250
I20221019 15:33:29.551901 1057827 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 2, Device: 2 needs to allocate [ 12763.8 MiB ] device memory. 
   In general, Chunk id: 3  memory is [ 5422.6 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6054 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7328.55 MiB ] with mem_block_num = 2250
I20221019 15:33:29.555778 1057827 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 3, Device: 3 needs to allocate [ 12763.8 MiB ] device memory. 
   In general, Chunk id: 1  memory is [ 5422.6 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6054 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7328.55 MiB ] with mem_block_num = 2250
I20221019 15:33:29.559621 1057827 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 4, Device: 4 needs to allocate [ 12763.8 MiB ] device memory. 
   In general, Chunk id: 0  memory is [ 5422.6 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.607 MiB ] with mem_block_num = 9425
        Eager Variable Tensor total memory is [ 7328.55 MiB ] with mem_block_num = 2250
I20221019 15:33:29.563469 1057827 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 5, Device: 5 needs to allocate [ 12763.8 MiB ] device memory. 
   In general, Chunk id: 2  memory is [ 5422.6 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.607 MiB ] with mem_block_num = 9425
        Eager Variable Tensor total memory is [ 7328.55 MiB ] with mem_block_num = 2250
I20221019 15:33:29.567335 1057827 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 6, Device: 6 needs to allocate [ 12763.8 MiB ] device memory. 
   In general, Chunk id: 5  memory is [ 5422.6 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.607 MiB ] with mem_block_num = 9425
        Eager Variable Tensor total memory is [ 7328.55 MiB ] with mem_block_num = 2250
I20221019 15:33:29.571290 1057827 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 7, Device: 7 needs to allocate [ 12763.8 MiB ] device memory. 
   In general, Chunk id: 7  memory is [ 5422.6 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.607 MiB ] with mem_block_num = 9425
        Eager Variable Tensor total memory is [ 7328.55 MiB ] with mem_block_num = 2250
I20221019 15:33:29.990557 1057827 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenMemAndLightPlanLog","mem_rss":"22176.000000 MB","mem_vm":"372719.000000 MB","time_cost":"0 seconds"}
I20221019 15:33:29.992477 1057827 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 CompilePlan","mem_rss":"22176.000000 MB","mem_vm":"372719.000000 MB","time_cost":"30 seconds"}
I20221019 15:33:37.748606 1057827 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 SyncPlan","mem_rss":"22178.000000 MB","mem_vm":"372719.000000 MB","time_cost":"7 seconds"}
xiezipeng-ML commented 2 years ago
Log file created at: 2022/10/20 03:51:34
Running on machine: oneflow-28
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20221020 03:51:34.293077 2539781 env_global_objects_scope.cpp:167] Using rpc backend: gRPC
I20221020 03:51:34.293746 2539781 ctrl_server.cpp:40] CtrlServer listening on 0.0.0.0:40549
I20221020 03:51:34.293843 2539781 rank_info_bootstrap_server.cpp:47] RankInfoBootstrapServer listening on 0.0.0.0:12345
I20221020 03:51:34.868077 2539781 epoll_comm_network.cpp:63] CommNet:Epoll listening on 0.0.0.0:35169
I20221020 03:51:45.075901 2539781 ibverbs_comm_network.cpp:147] Using IB device mlx5_0 port 1 gid index 0
I20221020 03:51:52.165841 2539781 ibverbs_comm_network.cpp:177] Connected to peer 7
I20221020 03:51:52.214684 2539781 ibverbs_comm_network.cpp:177] Connected to peer 2
I20221020 03:51:52.269661 2539781 ibverbs_comm_network.cpp:177] Connected to peer 1
I20221020 03:51:52.319540 2539781 ibverbs_comm_network.cpp:177] Connected to peer 3
I20221020 03:51:52.373648 2539781 ibverbs_comm_network.cpp:177] Connected to peer 4
I20221020 03:51:52.404248 2539781 ibverbs_comm_network.cpp:177] Connected to peer 5
I20221020 03:51:52.428287 2539781 ibverbs_comm_network.cpp:177] Connected to peer 6
I20221020 03:54:08.694559 2539781 version.cpp:22] OneFlow git version: 63a3b287
I20221020 03:54:08.694598 2539781 cuda_device_manager_factory.cpp:63] CUDA runtime version: 11.2
I20221020 03:54:08.694610 2539781 cuda_device_manager_factory.cpp:72] cuDNN version: 8.1.1
I20221020 03:54:08.694617 2539781 cuda_device_manager_factory.cpp:85] NCCL version: 2.12.10
I20221020 03:54:09.465992 2539781 ibverbs_comm_network.cpp:147] Using IB device mlx5_0 port 1 gid index 0
I20221020 03:54:10.014689 2539781 ibverbs_comm_network.cpp:177] Connected to peer 7
I20221020 03:54:10.064841 2539781 ibverbs_comm_network.cpp:177] Connected to peer 2
I20221020 03:54:10.134281 2539781 ibverbs_comm_network.cpp:177] Connected to peer 1
I20221020 03:54:10.135252 2539781 ibverbs_comm_network.cpp:177] Connected to peer 3
I20221020 03:54:10.135728 2539781 ibverbs_comm_network.cpp:177] Connected to peer 4
I20221020 03:54:10.136314 2539781 ibverbs_comm_network.cpp:177] Connected to peer 5
I20221020 03:54:10.136888 2539781 ibverbs_comm_network.cpp:177] Connected to peer 6
I20221020 03:54:36.581490 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221020 03:54:46.789353 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221020 03:54:46.789674 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InsertPinnedIdentityOpPass","mem_rss":"4570.000000 MB","mem_vm":"57495.000000 MB","time_cost":"10 seconds"}
I20221020 03:54:46.789695 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221020 03:54:57.393759 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221020 03:54:57.394107 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 EliminateDeadNodesPass","mem_rss":"4593.000000 MB","mem_vm":"57518.000000 MB","time_cost":"10 seconds"}
I20221020 03:54:57.394126 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221020 03:54:57.394141 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221020 03:54:57.394181 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 NormalizationExponentialAverageAutoTickPass","mem_rss":"4593.000000 MB","mem_vm":"57518.000000 MB","time_cost":"0 seconds"}
I20221020 03:54:57.394191 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_3-AutoMixedPrecision
I20221020 03:55:08.028533 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_3-AutoMixedPrecision
I20221020 03:55:08.029656 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoMixedPrecision","mem_rss":"4593.000000 MB","mem_vm":"57518.000000 MB","time_cost":"10 seconds"}
I20221020 03:55:08.029682 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_4-PruneAmpWhiteIdentityOpPass
I20221020 03:55:19.663830 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_4-PruneAmpWhiteIdentityOpPass
I20221020 03:55:19.664950 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneAmpWhiteIdentityOpPass","mem_rss":"4612.000000 MB","mem_vm":"57536.000000 MB","time_cost":"11 seconds"}
I20221020 03:55:19.664976 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_5-OptimizerPlacementOptimizationPass
I20221020 03:55:31.183609 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_5-OptimizerPlacementOptimizationPass
I20221020 03:55:31.184535 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 OptimizerPlacementOptimizationPass","mem_rss":"4612.000000 MB","mem_vm":"57536.000000 MB","time_cost":"11 seconds"}
I20221020 03:55:31.184561 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_6-FuseAddToOutputPass
I20221020 03:55:42.510324 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_6-FuseAddToOutputPass
I20221020 03:55:42.511440 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseAddToOutputPass","mem_rss":"4632.000000 MB","mem_vm":"57557.000000 MB","time_cost":"11 seconds"}
I20221020 03:55:42.511466 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_7-IRRoundTripBeforeAD
I20221020 03:55:42.511484 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_7-IRRoundTripBeforeAD
I20221020 03:55:42.512432 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 IRRoundTripBeforeAD","mem_rss":"4632.000000 MB","mem_vm":"57557.000000 MB","time_cost":"0 seconds"}
I20221020 03:55:42.512449 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_8-DynamicLossScaleSchedulePass
I20221020 03:55:53.958298 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_8-DynamicLossScaleSchedulePass
I20221020 03:55:53.959398 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DynamicLossScaleSchedulePass","mem_rss":"4632.000000 MB","mem_vm":"57557.000000 MB","time_cost":"11 seconds"}
I20221020 03:55:53.959422 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_9-AutoTrainStep
I20221020 03:56:05.462833 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_9-AutoTrainStep
I20221020 03:56:05.463927 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoTrainStep","mem_rss":"4632.000000 MB","mem_vm":"57557.000000 MB","time_cost":"11 seconds"}
I20221020 03:56:05.463951 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_10-AutoLearningRate
I20221020 03:56:16.979053 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_10-AutoLearningRate
I20221020 03:56:16.980162 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoLearningRate","mem_rss":"4632.000000 MB","mem_vm":"57557.000000 MB","time_cost":"11 seconds"}
I20221020 03:56:16.980186 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_11-QuantAwareTraining
I20221020 03:56:16.980202 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_11-QuantAwareTraining
I20221020 03:56:16.981155 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 QuantAwareTraining","mem_rss":"4632.000000 MB","mem_vm":"57557.000000 MB","time_cost":"0 seconds"}
I20221020 03:56:16.981170 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_12-GenerateOptimizerOpConfs
I20221020 03:56:28.837091 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_12-GenerateOptimizerOpConfs
I20221020 03:56:28.838275 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenerateOptimizerOpConfs","mem_rss":"4633.000000 MB","mem_vm":"57557.000000 MB","time_cost":"11 seconds"}
I20221020 03:56:28.838300 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_13-PrunePinnedIdentityOpPass
I20221020 03:58:36.296578 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_13-PrunePinnedIdentityOpPass
I20221020 03:58:36.297967 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PrunePinnedIdentityOpPass","mem_rss":"4687.000000 MB","mem_vm":"57611.000000 MB","time_cost":"127 seconds"}
I20221020 03:58:36.297995 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_14-ReplaceEmbeddingOps
I20221020 03:58:52.778256 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_14-ReplaceEmbeddingOps
I20221020 03:58:52.779438 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 ReplaceEmbeddingOps","mem_rss":"4687.000000 MB","mem_vm":"57611.000000 MB","time_cost":"16 seconds"}
I20221020 03:58:52.779464 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_15-SequentialOneEmbeddingOpsPass
I20221020 03:58:52.779476 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_15-SequentialOneEmbeddingOpsPass
I20221020 03:58:52.780608 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SequentialOneEmbeddingOpsPass","mem_rss":"4687.000000 MB","mem_vm":"57611.000000 MB","time_cost":"0 seconds"}
I20221020 03:58:52.780624 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_16-FuseEmbeddingShuffleInteractionPass
I20221020 03:58:52.780632 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_16-FuseEmbeddingShuffleInteractionPass
I20221020 03:58:52.781713 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseEmbeddingShuffleInteractionPass","mem_rss":"4687.000000 MB","mem_vm":"57611.000000 MB","time_cost":"0 seconds"}
I20221020 03:58:52.781731 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_17-FuseBCEReduceMeanFwBwPass
I20221020 03:58:52.781739 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_17-FuseBCEReduceMeanFwBwPass
I20221020 03:58:52.782840 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseBCEReduceMeanFwBwPass","mem_rss":"4687.000000 MB","mem_vm":"57611.000000 MB","time_cost":"0 seconds"}
I20221020 03:58:52.782855 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_18-AddSspVariableProxy
I20221020 03:58:52.782871 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_18-AddSspVariableProxy
I20221020 03:58:52.783967 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddSspVariableProxy","mem_rss":"4687.000000 MB","mem_vm":"57611.000000 MB","time_cost":"0 seconds"}
I20221020 03:58:52.783979 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_19-CheckpointingPass
I20221020 03:59:09.162068 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_19-CheckpointingPass
I20221020 03:59:09.163251 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CheckpointingPass","mem_rss":"4687.000000 MB","mem_vm":"57611.000000 MB","time_cost":"16 seconds"}
I20221020 03:59:09.163275 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_20-CudnnFusedNormalizationAddReluPass
I20221020 03:59:09.168620 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_20-CudnnFusedNormalizationAddReluPass
I20221020 03:59:09.169771 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CudnnFusedNormalizationAddReluPass","mem_rss":"4687.000000 MB","mem_vm":"57611.000000 MB","time_cost":"0 seconds"}
I20221020 03:59:09.169790 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_21-PruneCastToStaticShapeOpsPass
I20221020 03:59:25.884598 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_21-PruneCastToStaticShapeOpsPass
I20221020 03:59:25.885782 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneCastToStaticShapeOpsPass","mem_rss":"4687.000000 MB","mem_vm":"57611.000000 MB","time_cost":"16 seconds"}
I20221020 03:59:25.885807 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_22-IRRoundTrip
I20221020 03:59:25.885823 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_22-IRRoundTrip
I20221020 03:59:25.886904 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 IRRoundTrip","mem_rss":"4687.000000 MB","mem_vm":"57611.000000 MB","time_cost":"0 seconds"}
I20221020 03:59:25.886921 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_23-FuseAddToOutputPass1
I20221020 03:59:42.181943 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_23-FuseAddToOutputPass1
I20221020 03:59:42.183125 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseAddToOutputPass","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"16 seconds"}
I20221020 03:59:42.183151 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_24-FuseConsecutiveAddPass
I20221020 04:01:19.667111 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_24-FuseConsecutiveAddPass
I20221020 04:01:19.668280 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseConsecutiveAddPass","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"97 seconds"}
I20221020 04:01:19.668303 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_25-IndexedSlicesOptimizerRewritePass
I20221020 04:01:19.668318 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_25-IndexedSlicesOptimizerRewritePass
I20221020 04:01:19.669410 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 IndexedSlicesOptimizerRewritePass","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"0 seconds"}
I20221020 04:01:19.669427 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_26-SplitSparseSoftmaxCrossEntropyOpPass
I20221020 04:01:35.808634 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_26-SplitSparseSoftmaxCrossEntropyOpPass
I20221020 04:01:35.809796 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SplitSparseSoftmaxCrossEntropyOpPass","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"16 seconds"}
I20221020 04:01:35.809823 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_27-DoParallelCastBeforeWideningTypeCast
I20221020 04:01:51.819540 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_27-DoParallelCastBeforeWideningTypeCast
I20221020 04:01:51.820730 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DoParallelCastBeforeWideningTypeCast","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"16 seconds"}
I20221020 04:01:51.820755 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_28-FuseCastScalePass
I20221020 04:02:07.811364 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_28-FuseCastScalePass
I20221020 04:02:07.812507 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseCastScalePass","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"15 seconds"}
I20221020 04:02:07.812543 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_29-PruneParallelCastOpsPass
I20221020 04:02:23.672039 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_29-PruneParallelCastOpsPass
I20221020 04:02:23.673318 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneParallelCastOpsPass","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"15 seconds"}
I20221020 04:02:23.673344 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_30-FuseUpdateOpsPass
I20221020 04:02:36.904345 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_30-FuseUpdateOpsPass
I20221020 04:02:36.904944 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseUpdateOpsPass","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"13 seconds"}
I20221020 04:02:36.904966 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_31-FuseModelUpdateCastOpsPass
I20221020 04:02:36.904983 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_31-FuseModelUpdateCastOpsPass
I20221020 04:02:36.906145 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseModelUpdateCastOpsPass","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"0 seconds"}
I20221020 04:02:36.906163 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_32-MultiTensorModelUpdatePass
I20221020 04:02:36.906172 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_32-MultiTensorModelUpdatePass
I20221020 04:02:36.907315 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiTensorModelUpdatePass","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"0 seconds"}
I20221020 04:02:36.907330 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_33-FixPipelineStageIdPass
I20221020 04:02:48.016405 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_33-FixPipelineStageIdPass
I20221020 04:02:48.017590 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FixPipelineStageIdPass","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"11 seconds"}
I20221020 04:02:48.017617 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_34-PipelineBufferPass
I20221020 04:02:59.088153 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_34-PipelineBufferPass
I20221020 04:02:59.089329 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PipelineBufferPass","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"11 seconds"}
I20221020 04:02:59.089350 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_35-AutoParallelPass
I20221020 04:02:59.089363 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_35-AutoParallelPass
I20221020 04:02:59.090857 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoParallelPass","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"0 seconds"}
I20221020 04:02:59.090875 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_36-DumpVariableInfoPass
I20221020 04:02:59.090885 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_36-DumpVariableInfoPass
I20221020 04:02:59.091941 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpVariableInfoPass","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"0 seconds"}
I20221020 04:02:59.091956 2539781 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_37-DumpBlobParallelConfPass
I20221020 04:03:10.248502 2539781 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_37-DumpBlobParallelConfPass
I20221020 04:03:10.249675 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"11 seconds"}
I20221020 04:03:10.281922 2539781 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 OptimizationLogicalGraph","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"513 seconds"}
I20221020 04:03:20.374733 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GroupBoxingByDstParallel","mem_rss":"4821.000000 MB","mem_vm":"57745.000000 MB","time_cost":"2 seconds"}
I20221020 04:03:33.175345 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 BoxingWithMiddleNodes","mem_rss":"4821.000000 MB","mem_vm":"57745.000000 MB","time_cost":"12 seconds"}
I20221020 04:03:35.233489 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SetCtrlInOpName4VariableOp","mem_rss":"4824.000000 MB","mem_vm":"57747.000000 MB","time_cost":"2 seconds"}
I20221020 04:03:37.541919 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoPrependTick","mem_rss":"4824.000000 MB","mem_vm":"57747.000000 MB","time_cost":"2 seconds"}
I20221020 04:03:39.564512 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddTickForTimeShape","mem_rss":"4824.000000 MB","mem_vm":"57747.000000 MB","time_cost":"2 seconds"}
I20221020 04:03:42.006861 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiClientAutoSourceAndSinkTick","mem_rss":"4834.000000 MB","mem_vm":"57758.000000 MB","time_cost":"2 seconds"}
I20221020 04:03:44.044476 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiClientAutoInterfaceCriticalSectionTick","mem_rss":"4834.000000 MB","mem_vm":"57758.000000 MB","time_cost":"2 seconds"}
I20221020 04:03:44.045892 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SystemOpFillJobNamePass","mem_rss":"4834.000000 MB","mem_vm":"57758.000000 MB","time_cost":"0 seconds"}
I20221020 04:03:46.165727 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"4834.000000 MB","mem_vm":"57758.000000 MB","time_cost":"2 seconds"}
I20221020 04:03:49.641379 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InsertNcclLogicalOpPass","mem_rss":"4886.000000 MB","mem_vm":"57810.000000 MB","time_cost":"3 seconds"}
I20221020 04:03:52.109158 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"4886.000000 MB","mem_vm":"57810.000000 MB","time_cost":"2 seconds"}
I20221020 04:03:54.531172 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CheckOpGraph","mem_rss":"4886.000000 MB","mem_vm":"57810.000000 MB","time_cost":"2 seconds"}
I20221020 04:03:54.532294 2539781 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 CompleteJob","mem_rss":"4886.000000 MB","mem_vm":"57810.000000 MB","time_cost":"40 seconds"}
I20221020 04:03:56.445068 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 NewOpGraph","mem_rss":"4886.000000 MB","mem_vm":"57810.000000 MB","time_cost":"1 seconds"}
I20221020 04:04:03.977631 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 BuildTaskGraph","mem_rss":"5343.000000 MB","mem_vm":"58267.000000 MB","time_cost":"7 seconds"}
I20221020 04:04:08.727674 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddTaskToPlan","mem_rss":"5941.000000 MB","mem_vm":"62876.000000 MB","time_cost":"4 seconds"}
I20221020 04:04:09.819131 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InferMemShare","mem_rss":"5956.000000 MB","mem_vm":"63420.000000 MB","time_cost":"1 seconds"}
I20221020 04:04:11.345964 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenMemBlockAndChunk","mem_rss":"5963.000000 MB","mem_vm":"62908.000000 MB","time_cost":"1 seconds"}
I20221020 04:04:11.346954 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 LogPlan","mem_rss":"5963.000000 MB","mem_vm":"62908.000000 MB","time_cost":"0 seconds"}
I20221020 04:04:11.445214 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenRegisterHint","mem_rss":"5963.000000 MB","mem_vm":"62908.000000 MB","time_cost":"0 seconds"}
I20221020 04:04:11.448345 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenCollectiveBoxingPlan","mem_rss":"5963.000000 MB","mem_vm":"62908.000000 MB","time_cost":"0 seconds"}
I20221020 04:04:11.550089 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpCtrlRegstInfoToPlan","mem_rss":"5963.000000 MB","mem_vm":"62908.000000 MB","time_cost":"0 seconds"}
I20221020 04:04:11.740911 2539781 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 0, Device: 0 needs to allocate [ 12763.8 MiB ] device memory. 
   In general, Chunk id: 1  memory is [ 5422.6 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6044 MiB ] with mem_block_num = 9366
        Eager Variable Tensor total memory is [ 7328.55 MiB ] with mem_block_num = 2250
I20221020 04:04:11.742763 2539781 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 1, Device: 1 needs to allocate [ 12763.8 MiB ] device memory. 
   In general, Chunk id: 5  memory is [ 5422.6 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6054 MiB ] with mem_block_num = 9372
        Eager Variable Tensor total memory is [ 7328.55 MiB ] with mem_block_num = 2250
I20221020 04:04:11.744565 2539781 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 2, Device: 2 needs to allocate [ 12763.8 MiB ] device memory. 
   In general, Chunk id: 0  memory is [ 5422.6 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6054 MiB ] with mem_block_num = 9372
        Eager Variable Tensor total memory is [ 7328.55 MiB ] with mem_block_num = 2250
I20221020 04:04:11.746392 2539781 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 3, Device: 3 needs to allocate [ 12763.8 MiB ] device memory. 
   In general, Chunk id: 6  memory is [ 5422.6 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6054 MiB ] with mem_block_num = 9372
        Eager Variable Tensor total memory is [ 7328.55 MiB ] with mem_block_num = 2250
I20221020 04:04:11.748205 2539781 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 4, Device: 4 needs to allocate [ 12763.8 MiB ] device memory. 
   In general, Chunk id: 2  memory is [ 5422.6 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.607 MiB ] with mem_block_num = 9379
        Eager Variable Tensor total memory is [ 7328.55 MiB ] with mem_block_num = 2250
I20221020 04:04:11.750018 2539781 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 5, Device: 5 needs to allocate [ 12763.8 MiB ] device memory. 
   In general, Chunk id: 7  memory is [ 5422.6 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.607 MiB ] with mem_block_num = 9379
        Eager Variable Tensor total memory is [ 7328.55 MiB ] with mem_block_num = 2250
I20221020 04:04:11.751811 2539781 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 6, Device: 6 needs to allocate [ 12763.8 MiB ] device memory. 
   In general, Chunk id: 4  memory is [ 5422.6 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.607 MiB ] with mem_block_num = 9379
        Eager Variable Tensor total memory is [ 7328.55 MiB ] with mem_block_num = 2250
I20221020 04:04:11.753628 2539781 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 7, Device: 7 needs to allocate [ 12763.8 MiB ] device memory. 
   In general, Chunk id: 3  memory is [ 5422.6 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.607 MiB ] with mem_block_num = 9379
        Eager Variable Tensor total memory is [ 7328.55 MiB ] with mem_block_num = 2250
I20221020 04:04:12.002398 2539781 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenMemAndLightPlanLog","mem_rss":"5963.000000 MB","mem_vm":"62908.000000 MB","time_cost":"0 seconds"}
I20221020 04:04:12.003510 2539781 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 CompilePlan","mem_rss":"5963.000000 MB","mem_vm":"62908.000000 MB","time_cost":"17 seconds"}
I20221020 04:04:14.765590 2539781 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 SyncPlan","mem_rss":"5966.000000 MB","mem_vm":"62908.000000 MB","time_cost":"2 seconds"}
F20221020 04:04:20.419255 2539781 memory_allocator.cpp:50] out of memory
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/memory/memory_allocator.cpp", line 50, in Allocate
    device->Alloc(options, &ptr, size)
Error Type: oneflow.ErrorProto.out_of_memory_error

@strint

xiezipeng-ML commented 2 years ago

config.py

from libai import evaluation
from libai.data.build import build_nlp_train_loader
from omegaconf import OmegaConf

from libai.config import LazyCall
from libai.evaluation import PPLEvaluator, evaluator
from libai.scheduler import WarmupExponentialLR

from configs.common.train import train
from configs.common.models.graph import graph

from projects.T5.configs.optim import optim
from projects.T5.configs.t5_model_config import cfg
from projects.T5.datasets.dataset import UnsuperviseT5Dataset, collate_fn
from projects.T5.models.t5_model import T5ForPreTraining

train_data_path = wudao_180g_bert_tokenized_512

micro_batch_size = 1
optim[lr] = 1e-5

# dataloader
dataloader = OmegaConf.create()
dataloader.train = LazyCall(build_nlp_train_loader)(
    dataset=[
        LazyCall(UnsuperviseT5Dataset)(
            data_path=train_data_path,
        )
    ],
    collate_fn=collate_fn(
        vocab_size=12900,
        max_seq_length=512,
        noise_density=0.15,
        mean_noise_span_length=3,
        eos_token_id=2,
        pad_token_id=0,
        decoder_start_token_id=1,
    ),
)

model = LazyCall(T5ForPreTraining)(cfg=cfg)

# model config
model.cfg.vocab_size = 12900
model.cfg.hidden_size = 1024
model.cfg.hidden_layers = 24
model.cfg.num_attention_heads = 44
model.cfg.head_size = 128
model.cfg.intermediate_size = 21504
model.cfg.hidden_dropout_prob = 0.1
model.cfg.attention_probs_dropout_prob = 0.1
model.cfg.embedding_dropout_prob = 0.1
model.cfg.layernorm_eps = 1e-6
model.cfg.model_type = mt5
model.cfg.pretrained_model_path = None

train.update(
    dict(
        output_dir=projectsT5outputmt5_output,
        train_micro_batch_size=micro_batch_size,
        train_epoch=10,
        log_period=100,
        num_accumulation_steps=1,
        amp=dict(enabled=True),
        warmup_ratio=0.01,
        checkpointer=dict(period=10000, max_to_keep=10),
        dist=dict(
            data_parallel_size=2,
            tensor_parallel_size=4,
            pipeline_parallel_size=1,
            pipeline_num_layers=2  model.cfg.hidden_layers,
        ),
        scheduler=LazyCall(WarmupExponentialLR)(
            warmup_factor=0.001,
            gamma=1.0,
            warmup_method=linear,
            warmup_iter=0.0,
        ),
        evaluation=dict(
            evaluator=LazyCall(PPLEvaluator)(),
            enabled=True,
            eval_iter=20,
            eval_period=5000,
        ),
    )
)

train.zero_optimization.enabled = True
train.zero_optimization.stage = 2
strint commented 2 years ago
I20221020 03:51:52.428287 2539781 ibverbs_comm_network.cpp:177] Connected to peer 6
I20221020 03:54:08.694559 2539781 version.cpp:22] OneFlow git version: 63a3b287

显示 build model 是2分16秒;

OptimizationLogicalGraph","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"513 seconds"}
CompleteJob","mem_rss":"4886.000000 MB","mem_vm":"57810.000000 MB","time_cost":"40 seconds"}
CompilePlan","mem_rss":"5963.000000 MB","mem_vm":"62908.000000 MB","time_cost":"17 seconds"}

OptimizationLogicalGraph 有点慢,8.6分钟,不正常

xiezipeng-ML commented 2 years ago

stable版本:

Log file created at: 2022/10/20 08:39:21
Running on machine: oneflow-28
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20221020 08:39:21.138608 3290581 env_global_objects_scope.cpp:153] Using rpc backend: gRPC
I20221020 08:39:21.139333 3290581 ctrl_server.cpp:40] CtrlServer listening on 0.0.0.0:36855
I20221020 08:39:21.139430 3290581 rank_info_bootstrap_server.cpp:47] RankInfoBootstrapServer listening on 0.0.0.0:12345
I20221020 08:39:21.680302 3290581 epoll_comm_network.cpp:63] CommNet:Epoll listening on 0.0.0.0:32861
I20221020 08:39:30.955756 3290581 ibverbs_comm_network.cpp:144] Using IB device mlx5_0 port 1 gid index 0
I20221020 08:39:39.147533 3290581 ibverbs_comm_network.cpp:174] Connected to peer 7
I20221020 08:39:39.202556 3290581 ibverbs_comm_network.cpp:174] Connected to peer 2
I20221020 08:39:39.229348 3290581 ibverbs_comm_network.cpp:174] Connected to peer 1
I20221020 08:39:39.252177 3290581 ibverbs_comm_network.cpp:174] Connected to peer 3
I20221020 08:39:39.271689 3290581 ibverbs_comm_network.cpp:174] Connected to peer 4
I20221020 08:39:39.291220 3290581 ibverbs_comm_network.cpp:174] Connected to peer 5
I20221020 08:39:39.311214 3290581 ibverbs_comm_network.cpp:174] Connected to peer 6
I20221020 08:41:49.112869 3290581 version.cpp:22] OneFlow git version: a6d4cb80
I20221020 08:41:49.112905 3290581 cuda_device_manager_factory.cpp:63] CUDA runtime version: 11.2
I20221020 08:41:49.112916 3290581 cuda_device_manager_factory.cpp:72] cuDNN version: 8.1.1
I20221020 08:41:49.112921 3290581 cuda_device_manager_factory.cpp:85] NCCL version: 2.12.10
I20221020 08:42:08.198295 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_0-ModelUpdateConfCompatiblePass
I20221020 08:42:11.967969 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_0-ModelUpdateConfCompatiblePass
I20221020 08:42:11.968024 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_1-AddInputOutputOpsPass
I20221020 08:42:15.739935 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_1-AddInputOutputOpsPass
I20221020 08:42:15.739993 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221020 08:42:15.740008 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221020 08:42:15.740015 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_3-AutoMixedPrecision
I20221020 08:42:19.544682 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_3-AutoMixedPrecision
I20221020 08:42:19.544732 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_4-PruneAmpWhiteIdentityOpPass
I20221020 08:42:24.313907 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_4-PruneAmpWhiteIdentityOpPass
I20221020 08:42:24.313971 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_5-OptimizerPlacementOptimizationPass
I20221020 08:42:28.994552 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_5-OptimizerPlacementOptimizationPass
I20221020 08:42:28.994611 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_6-DynamicLossScaleSchedulePass
I20221020 08:42:33.355998 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_6-DynamicLossScaleSchedulePass
I20221020 08:42:33.356051 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_7-AutoTrainStep
I20221020 08:42:37.727663 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_7-AutoTrainStep
I20221020 08:42:37.727715 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_8-AutoLearningRate
I20221020 08:42:42.108742 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_8-AutoLearningRate
I20221020 08:42:42.108819 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_9-QuantAwareTraining
I20221020 08:42:42.108829 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_9-QuantAwareTraining
I20221020 08:42:42.108832 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_10-IRRoundTripBeforeAD
I20221020 08:42:42.108870 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_10-IRRoundTripBeforeAD
I20221020 08:42:42.108875 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_11-GenerateBackwardAndOptimizerOpConfs
I20221020 08:42:47.128196 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_11-GenerateBackwardAndOptimizerOpConfs
I20221020 08:42:47.128262 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_12-ReplaceEmbeddingOps
I20221020 08:42:58.899493 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_12-ReplaceEmbeddingOps
I20221020 08:42:58.899564 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_13-AddSspVariableProxy
I20221020 08:42:58.899581 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_13-AddSspVariableProxy
I20221020 08:42:58.899585 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_14-CheckpointingPass
I20221020 08:43:10.734640 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_14-CheckpointingPass
I20221020 08:43:10.734710 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_15-CudnnFusedNormalizationAddReluPass
I20221020 08:43:22.521386 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_15-CudnnFusedNormalizationAddReluPass
I20221020 08:43:22.521446 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_16-PruneCastToStaticShapeOpsPass
I20221020 08:43:34.574124 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_16-PruneCastToStaticShapeOpsPass
I20221020 08:43:34.574187 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_17-IRRoundTrip
I20221020 08:43:34.574234 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_17-IRRoundTrip
I20221020 08:43:34.574241 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_18-FuseAddToOutputPass
I20221020 08:43:46.127857 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_18-FuseAddToOutputPass
I20221020 08:43:46.127916 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_19-FuseAddToOutputPass1
I20221020 08:43:58.234127 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_19-FuseAddToOutputPass1
I20221020 08:43:58.234194 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_20-IndexedSlicesOptimizerRewritePass
I20221020 08:43:58.234222 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_20-IndexedSlicesOptimizerRewritePass
I20221020 08:43:58.234227 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_21-SplitSparseSoftmaxCrossEntropyOpPass
I20221020 08:44:10.234905 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_21-SplitSparseSoftmaxCrossEntropyOpPass
I20221020 08:44:10.234980 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_22-DoParallelCastBeforeWideningTypeCast
I20221020 08:44:22.114862 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_22-DoParallelCastBeforeWideningTypeCast
I20221020 08:44:22.114944 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_23-AddLbiDiffWatcherOpConfs
I20221020 08:44:22.132622 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_23-AddLbiDiffWatcherOpConfs
I20221020 08:44:22.132663 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_24-FuseCastScalePass
I20221020 08:44:34.025393 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_24-FuseCastScalePass
I20221020 08:44:34.025456 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_25-PruneParallelCastOpsPass
I20221020 08:44:45.645437 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_25-PruneParallelCastOpsPass
I20221020 08:44:45.645503 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_26-FuseUpdateOpsPass
I20221020 08:44:55.144882 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_26-FuseUpdateOpsPass
I20221020 08:44:55.144946 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_27-FuseModelUpdateCastOpsPass
I20221020 08:44:55.144963 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_27-FuseModelUpdateCastOpsPass
I20221020 08:44:55.144968 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_28-MultiTensorModelUpdatePass
I20221020 08:44:55.144973 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_28-MultiTensorModelUpdatePass
I20221020 08:44:55.144977 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_29-FixPipelineStageIdPass
I20221020 08:45:02.465996 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_29-FixPipelineStageIdPass
I20221020 08:45:02.466066 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_30-PipelineBufferPass
I20221020 08:45:09.785943 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_30-PipelineBufferPass
I20221020 08:45:09.786007 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_31-DumpVariableInfoPass
I20221020 08:45:09.786026 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_31-DumpVariableInfoPass
I20221020 08:45:09.786031 3290581 job_build_and_infer_ctx.cpp:1004] GraphBase_0 start compiling with pass pass_cnt_32-DumpBlobParallelConfPass
I20221020 08:45:17.184469 3290581 job_build_and_infer_ctx.cpp:1018] GraphBase_0 finish compiling with pass pass_cnt_32-DumpBlobParallelConfPass
I20221020 08:47:05.943668 3290581 nn_graph.cpp:288] Graph name: GraphBase_0 compile time: 69.6766 seconds.
I20221020 08:47:06.283527 3290581 plan_util.cpp:963]  Graph name GraphBase_0 in Rank: 0, Device: 0 needs to allocate [ 11601.3 MiB ] device memory. 
 In general, Chunk id: 1  memory is [ 4258.68 MiB ]; 
 Memory out of Chunk is  [ 7342.65 MiB ]; and in particular: Eager Variable Tensor total memory is [ 7330.05 MiB ].
I20221020 08:47:06.285466 3290581 plan_util.cpp:963]  Graph name GraphBase_0 in Rank: 1, Device: 1 needs to allocate [ 11601.3 MiB ] device memory. 
 In general, Chunk id: 2  memory is [ 4258.68 MiB ]; 
 Memory out of Chunk is  [ 7342.65 MiB ]; and in particular: Eager Variable Tensor total memory is [ 7330.05 MiB ].
I20221020 08:47:06.287353 3290581 plan_util.cpp:963]  Graph name GraphBase_0 in Rank: 2, Device: 2 needs to allocate [ 11601.3 MiB ] device memory. 
 In general, Chunk id: 3  memory is [ 4258.68 MiB ]; 
 Memory out of Chunk is  [ 7342.65 MiB ]; and in particular: Eager Variable Tensor total memory is [ 7330.05 MiB ].
I20221020 08:47:06.289142 3290581 plan_util.cpp:963]  Graph name GraphBase_0 in Rank: 3, Device: 3 needs to allocate [ 11601.3 MiB ] device memory. 
 In general, Chunk id: 5  memory is [ 4258.68 MiB ]; 
 Memory out of Chunk is  [ 7342.65 MiB ]; and in particular: Eager Variable Tensor total memory is [ 7330.05 MiB ].
I20221020 08:47:06.290988 3290581 plan_util.cpp:963]  Graph name GraphBase_0 in Rank: 4, Device: 4 needs to allocate [ 11601.3 MiB ] device memory. 
 In general, Chunk id: 6  memory is [ 4258.68 MiB ]; 
 Memory out of Chunk is  [ 7342.65 MiB ]; and in particular: Eager Variable Tensor total memory is [ 7330.05 MiB ].
I20221020 08:47:06.292757 3290581 plan_util.cpp:963]  Graph name GraphBase_0 in Rank: 5, Device: 5 needs to allocate [ 11601.3 MiB ] device memory. 
 In general, Chunk id: 7  memory is [ 4258.68 MiB ]; 
 Memory out of Chunk is  [ 7342.65 MiB ]; and in particular: Eager Variable Tensor total memory is [ 7330.05 MiB ].
I20221020 08:47:06.294561 3290581 plan_util.cpp:963]  Graph name GraphBase_0 in Rank: 6, Device: 6 needs to allocate [ 11601.3 MiB ] device memory. 
 In general, Chunk id: 4  memory is [ 4258.68 MiB ]; 
 Memory out of Chunk is  [ 7342.65 MiB ]; and in particular: Eager Variable Tensor total memory is [ 7330.05 MiB ].
I20221020 08:47:06.296332 3290581 plan_util.cpp:963]  Graph name GraphBase_0 in Rank: 7, Device: 7 needs to allocate [ 11601.3 MiB ] device memory. 
 In general, Chunk id: 0  memory is [ 4258.68 MiB ]; 
 Memory out of Chunk is  [ 7342.65 MiB ]; and in particular: Eager Variable Tensor total memory is [ 7330.05 MiB ].
F20221020 08:47:15.057462 3290581 memory_allocator.cpp:59] 
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/memory/memory_allocator.cpp", line 59, in Allocate
    device->Alloc(options, &ptr, size)
out of memory
Error Type: oneflow.ErrorProto.runtime_error

08:42:08~08:45:17

xiezipeng-ML commented 2 years ago

nightly版本

Log file created at: 2022/10/20 09:25:17
Running on machine: oneflow-28
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20221020 09:25:17.249008 3303039 env_global_objects_scope.cpp:167] Using rpc backend: gRPC
I20221020 09:25:17.249729 3303039 ctrl_server.cpp:40] CtrlServer listening on 0.0.0.0:34193
I20221020 09:25:17.249859 3303039 rank_info_bootstrap_server.cpp:47] RankInfoBootstrapServer listening on 0.0.0.0:12345
I20221020 09:25:17.822063 3303039 epoll_comm_network.cpp:63] CommNet:Epoll listening on 0.0.0.0:45855
I20221020 09:25:27.920454 3303039 ibverbs_comm_network.cpp:147] Using IB device mlx5_0 port 1 gid index 0
I20221020 09:25:35.538646 3303039 ibverbs_comm_network.cpp:177] Connected to peer 7
I20221020 09:25:35.595392 3303039 ibverbs_comm_network.cpp:177] Connected to peer 2
I20221020 09:25:35.634617 3303039 ibverbs_comm_network.cpp:177] Connected to peer 1
I20221020 09:25:35.657667 3303039 ibverbs_comm_network.cpp:177] Connected to peer 3
I20221020 09:25:35.678911 3303039 ibverbs_comm_network.cpp:177] Connected to peer 4
I20221020 09:25:35.698951 3303039 ibverbs_comm_network.cpp:177] Connected to peer 5
I20221020 09:25:35.718647 3303039 ibverbs_comm_network.cpp:177] Connected to peer 6
I20221020 09:27:49.066449 3303039 version.cpp:22] OneFlow git version: f97f09f1
I20221020 09:27:49.066488 3303039 cuda_device_manager_factory.cpp:63] CUDA runtime version: 11.2
I20221020 09:27:49.066498 3303039 cuda_device_manager_factory.cpp:72] cuDNN version: 8.1.1
I20221020 09:27:49.066504 3303039 cuda_device_manager_factory.cpp:85] NCCL version: 2.12.10
I20221020 09:27:49.885689 3303039 ibverbs_comm_network.cpp:147] Using IB device mlx5_0 port 1 gid index 0
I20221020 09:27:50.549619 3303039 ibverbs_comm_network.cpp:177] Connected to peer 7
I20221020 09:27:50.550415 3303039 ibverbs_comm_network.cpp:177] Connected to peer 2
I20221020 09:27:50.551158 3303039 ibverbs_comm_network.cpp:177] Connected to peer 1
I20221020 09:27:50.551666 3303039 ibverbs_comm_network.cpp:177] Connected to peer 3
I20221020 09:27:50.555306 3303039 ibverbs_comm_network.cpp:177] Connected to peer 4
I20221020 09:27:50.561012 3303039 ibverbs_comm_network.cpp:177] Connected to peer 5
I20221020 09:27:50.561810 3303039 ibverbs_comm_network.cpp:177] Connected to peer 6
I20221020 09:28:15.111443 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221020 09:28:23.583436 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221020 09:28:23.583492 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221020 09:28:32.612527 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221020 09:28:32.612625 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221020 09:28:32.612653 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221020 09:28:32.612660 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_3-AutoMixedPrecision
I20221020 09:28:41.563009 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_3-AutoMixedPrecision
I20221020 09:28:41.563064 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_4-PruneAmpWhiteIdentityOpPass
I20221020 09:28:51.392601 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_4-PruneAmpWhiteIdentityOpPass
I20221020 09:28:51.392649 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_5-OptimizerPlacementOptimizationPass
I20221020 09:29:01.120366 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_5-OptimizerPlacementOptimizationPass
I20221020 09:29:01.120432 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_6-FuseAddToOutputPass
I20221020 09:29:10.673734 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_6-FuseAddToOutputPass
I20221020 09:29:10.673787 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_7-IRRoundTripBeforeAD
I20221020 09:29:10.673853 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_7-IRRoundTripBeforeAD
I20221020 09:29:10.673859 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_8-DynamicLossScaleSchedulePass
I20221020 09:29:20.350908 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_8-DynamicLossScaleSchedulePass
I20221020 09:29:20.350980 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_9-AutoTrainStep
I20221020 09:29:30.088905 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_9-AutoTrainStep
I20221020 09:29:30.088970 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_10-AutoLearningRate
I20221020 09:29:39.770726 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_10-AutoLearningRate
I20221020 09:29:39.770792 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_11-QuantAwareTraining
I20221020 09:29:39.770838 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_11-QuantAwareTraining
I20221020 09:29:39.770845 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_12-GenerateOptimizerOpConfs
I20221020 09:29:49.823612 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_12-GenerateOptimizerOpConfs
I20221020 09:29:49.823676 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_13-PrunePinnedIdentityOpPass
I20221020 09:32:09.988937 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_13-PrunePinnedIdentityOpPass
I20221020 09:32:09.989015 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_14-ReplaceEmbeddingOps
I20221020 09:32:25.541950 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_14-ReplaceEmbeddingOps
I20221020 09:32:25.542016 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_15-SequentialOneEmbeddingOpsPass
I20221020 09:32:25.542028 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_15-SequentialOneEmbeddingOpsPass
I20221020 09:32:25.542032 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_16-FuseEmbeddingShuffleInteractionPass
I20221020 09:32:25.542038 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_16-FuseEmbeddingShuffleInteractionPass
I20221020 09:32:25.542042 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_17-FuseBCEReduceMeanFwBwPass
I20221020 09:32:25.542048 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_17-FuseBCEReduceMeanFwBwPass
I20221020 09:32:25.542053 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_18-AddSspVariableProxy
I20221020 09:32:25.542061 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_18-AddSspVariableProxy
I20221020 09:32:25.542068 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_19-CheckpointingPass
I20221020 09:32:41.098794 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_19-CheckpointingPass
I20221020 09:32:41.098851 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_20-CudnnFusedNormalizationAddReluPass
I20221020 09:32:41.103590 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_20-CudnnFusedNormalizationAddReluPass
I20221020 09:32:41.103621 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_21-PruneCastToStaticShapeOpsPass
I20221020 09:32:56.775683 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_21-PruneCastToStaticShapeOpsPass
I20221020 09:32:56.775751 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_22-IRRoundTrip
I20221020 09:32:56.775790 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_22-IRRoundTrip
I20221020 09:32:56.775796 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_23-FuseAddToOutputPass1
I20221020 09:33:12.057500 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_23-FuseAddToOutputPass1
I20221020 09:33:12.057564 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_24-FuseConsecutiveAddPass
I20221020 09:34:42.774291 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_24-FuseConsecutiveAddPass
I20221020 09:34:42.774354 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_25-IndexedSlicesOptimizerRewritePass
I20221020 09:34:42.774366 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_25-IndexedSlicesOptimizerRewritePass
I20221020 09:34:42.774370 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_26-SplitSparseSoftmaxCrossEntropyOpPass
I20221020 09:34:57.934859 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_26-SplitSparseSoftmaxCrossEntropyOpPass
I20221020 09:34:57.934917 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_27-DoParallelCastBeforeWideningTypeCast
I20221020 09:35:12.938472 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_27-DoParallelCastBeforeWideningTypeCast
I20221020 09:35:12.938534 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_28-FuseCastScalePass
I20221020 09:35:28.044111 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_28-FuseCastScalePass
I20221020 09:35:28.044173 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_29-PruneParallelCastOpsPass
I20221020 09:35:42.752841 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_29-PruneParallelCastOpsPass
I20221020 09:35:42.752914 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_30-FuseUpdateOpsPass
I20221020 09:35:54.633841 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_30-FuseUpdateOpsPass
I20221020 09:35:54.633901 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_31-FuseModelUpdateCastOpsPass
I20221020 09:35:54.633915 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_31-FuseModelUpdateCastOpsPass
I20221020 09:35:54.633920 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_32-MultiTensorModelUpdatePass
I20221020 09:35:54.633926 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_32-MultiTensorModelUpdatePass
I20221020 09:35:54.633930 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_33-FixPipelineStageIdPass
I20221020 09:36:04.078356 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_33-FixPipelineStageIdPass
I20221020 09:36:04.078423 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_34-PipelineBufferPass
I20221020 09:36:13.516598 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_34-PipelineBufferPass
I20221020 09:36:13.516666 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_35-AutoParallelPass
I20221020 09:36:13.516681 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_35-AutoParallelPass
I20221020 09:36:13.516686 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_36-DumpVariableInfoPass
I20221020 09:36:13.516695 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_36-DumpVariableInfoPass
I20221020 09:36:13.516698 3303039 job_build_and_infer_ctx.cpp:940] GraphBase_0 start compiling with pass pass_cnt_37-DumpBlobParallelConfPass
I20221020 09:36:23.038504 3303039 job_build_and_infer_ctx.cpp:954] GraphBase_0 finish compiling with pass pass_cnt_37-DumpBlobParallelConfPass
I20221020 09:37:14.975693 3303039 time_util.h:93] [count log]{"loc":"Graph name: GraphBase_0 compile plan","mem_rss":"5700.000000 MB","mem_vm":"62709.000000 MB","time_cost":"14 seconds"}
I20221020 09:37:15.287984 3303039 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 0, Device: 0 needs to allocate [ 11852.7 MiB ] device memory. 
   In general, Chunk id: 2  memory is [ 4510.02 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6049 MiB ] with mem_block_num = 9405
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221020 09:37:15.289862 3303039 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 1, Device: 1 needs to allocate [ 11852.7 MiB ] device memory. 
   In general, Chunk id: 4  memory is [ 4510.02 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.606 MiB ] with mem_block_num = 9411
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221020 09:37:15.291712 3303039 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 2, Device: 2 needs to allocate [ 11854.9 MiB ] device memory. 
   In general, Chunk id: 3  memory is [ 4512.23 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.606 MiB ] with mem_block_num = 9411
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221020 09:37:15.293516 3303039 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 3, Device: 3 needs to allocate [ 11852.7 MiB ] device memory. 
   In general, Chunk id: 1  memory is [ 4510.02 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.606 MiB ] with mem_block_num = 9411
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221020 09:37:15.295313 3303039 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 4, Device: 4 needs to allocate [ 11852.7 MiB ] device memory. 
   In general, Chunk id: 7  memory is [ 4510.02 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221020 09:37:15.297089 3303039 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 5, Device: 5 needs to allocate [ 11854.9 MiB ] device memory. 
   In general, Chunk id: 5  memory is [ 4512.23 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221020 09:37:15.298871 3303039 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 6, Device: 6 needs to allocate [ 11852.7 MiB ] device memory. 
   In general, Chunk id: 6  memory is [ 4510.02 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221020 09:37:15.300642 3303039 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 7, Device: 7 needs to allocate [ 11852.7 MiB ] device memory. 
   In general, Chunk id: 0  memory is [ 4510.02 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
F20221020 09:37:23.271004 3303039 memory_allocator.cpp:50] out of memory
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/memory/memory_allocator.cpp", line 50, in Allocate
    device->Alloc(options, &ptr, size)
Error Type: oneflow.ErrorProto.out_of_memory_error

09:28:15~09:36:23

strint commented 2 years ago

结论, pass 部分的速度相比 0.8 的确慢了不少,处理办法参见 tips 部分

pass name 0.8 (秒) nightly (秒) tips
total 189 488
InsertPinnedIdentityOpPass none 8
EliminateDeadNodesPass none 9
ModelUpdateConfCompatiblePass 4 none
AddInputOutputOpsPass 4 none
NormalizationExponentialAverageAutoTickPass 0 0
AutoMixedPrecision 4 9 后向展开了,op 数量翻倍
PruneAmpWhiteIdentityOpPass 5 10
OptimizerPlacementOptimizationPass 4 10
DynamicLossScaleSchedulePass 4 10
AutoTrainStep 4 10
AutoLearningRate 4 10
QuantAwareTraining 0 0
IRRoundTripBeforeAD 0 0
GenerateBackwardAndOptimizerOpConfs 5 10(no backward)
PrunePinnedIdentityOpPass none 140 需要优化 , done
ReplaceEmbeddingOps 11 16 了解下pass的基础开销,这个按说一般不生效, done
SequentialOneEmbeddingOpsPass none 0
FuseEmbeddingShuffleInteractionPass none 0
FuseBCEReduceMeanFwBwPass none 0
AddSspVariableProxy 0 0
CheckpointingPass 12 15
CudnnFusedNormalizationAddReluPass 12 0 这里做了无用pass的优化 https://github.com/Oneflow-Inc/oneflow/pull/9219
PruneCastToStaticShapeOpsPass 12 15
IRRoundTrip 0 0
FuseAddToOutputPass 12 10
FuseAddToOutputPass1 12 15
FuseConsecutiveAddPass none 90 需要优化 , done
IndexedSlicesOptimizerRewritePass 0 0
SplitSparseSoftmaxCrossEntropyOpPass 12 15
DoParallelCastBeforeWideningTypeCast 12 15
AddLbiDiffWatcherOpConfs 0 none
FuseCastScalePass 12 15
PruneParallelCastOpsPass 12 15
FuseUpdateOpsPass 10 12
FuseModelUpdateCastOpsPass 0 0
MultiTensorModelUpdatePass 0 0
FixPipelineStageIdPass 7 10 done
PipelineBufferPass 7 9 done
AutoParallelPass none 0
DumpVariableInfoPass 0 0
DumpBlobParallelConfPass 8 10
strint commented 2 years ago

把 T5 的 hiddent layer 配置成24层,可以比较明显的复现上面的问题:

model.cfg.hidden_layers = 24
InsertPinnedIdentityOpPass","mem_rss":"4377.000000 MB","mem_vm":"26147.000000 MB","time_cost":"3619 milliseconds"}

PrunePinnedIdentityOpPass","mem_rss":"4563.000000 MB","mem_vm":"26332.000000 MB","time_cost":"468179 milliseconds"}

FuseConsecutiveAddPass","mem_rss":"4563.000000 MB","mem_vm":"26332.000000 MB","time_cost":"13704 milliseconds"}

以这个例子作为base和测试用例

hjchen2 commented 2 years ago

PrunePinnedIdentityOpPass和InsertPinnedIdentityOpPass是一对对偶的操作,代码几乎也是一样的,按理来说它们的耗时应该是一样的,除非job_builder提供的AddOps和DelOps两个接口有明显的耗时差异。

hjchen2 commented 2 years ago

FuseConsecutiveAddPass为了合并完全会多次循环去做fuse,可以看一下这个模型里是不是循环了太多次。

Maybe<void> Apply(Job* job, JobPassCtx* ctx) const override {
    bool changed = false;
    do {
      const OpGraph op_graph(*job);
      JobBuilder job_builder(job);
      changed = JUST(Apply(op_graph, &job_builder));
    } while (changed);
    return Maybe<void>::Ok();
  }
strint commented 2 years ago

PrunePinnedIdentityOpPass和InsertPinnedIdentityOpPass是一对对偶的操作,代码几乎也是一样的,按理来说它们的耗时应该是一样的,除非job_builder提供的AddOps和DelOps两个接口有明显的耗时差异。

DelOps 是很慢,参见 https://github.com/Oneflow-Inc/oneflow/pull/9235#discussion_r993280382

改成了 batch del, https://github.com/Oneflow-Inc/oneflow/pull/9281 之前:PrunePinnedIdentityOpPass","mem_rss":"4563.000000 MB","mem_vm":"26332.000000 MB","time_cost":"468179 milliseconds" 之后:PrunePinnedIdentityOpPass","mem_rss":"4558.000000 MB","mem_vm":"26320.000000 MB","time_cost":"8004 milliseconds"

strint commented 2 years ago

FuseConsecutiveAddPass 时间开销应该是和循环次数相关

这样做两次循环,开销就是其它 pass 的两倍了。

I20221021 17:39:30.089967 123852 time_util.h:97] [count log]{"loc":"start","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"0 milliseconds"}
I20221021 17:39:35.753731 123852 time_util.h:97] [count log]{"loc":"create op graph","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"5664 milliseconds"}
I20221021 17:39:35.780948 123852 time_util.h:97] [count log]{"loc":"create job builder","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"27 milliseconds"}
I20221021 17:39:36.257442 123852 time_util.h:97] [count log]{"loc":"fuse_add_apply","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"476 milliseconds"}
I20221021 17:39:36.257534 123852 fuse_consecutive_add_pass.cpp:44] apply cnt 0

I20221021 17:39:36.924232 123852 time_util.h:97] [count log]{"loc":"start","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"665 milliseconds"}
I20221021 17:39:42.702358 123852 time_util.h:97] [count log]{"loc":"create op graph","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"5779 milliseconds"}
I20221021 17:39:42.753391 123852 time_util.h:97] [count log]{"loc":"create job builder","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"51 milliseconds"}
I20221021 17:39:42.771134 123852 time_util.h:97] [count log]{"loc":"fuse_add_apply","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"17 milliseconds"}
I20221021 17:39:42.771194 123852 fuse_consecutive_add_pass.cpp:44] apply cnt 1

一个循环内的主要开销是 op graph 的创建:

const OpGraph op_graph(*job);
{"loc":"create op graph","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"5664 milliseconds"}
strint commented 2 years ago

有空的话,帮忙编译下这个 pr 对应的分支,跑下 t5 八卡的任务,开下 GLOG_v = 1,验证一个问题 @xiezipeng-ML

strint commented 2 years ago

init op graph 值得优化

是大部分pass 的主要开销

GraphBase_0 start compiling with pass pass_cnt_6-FuseAddToOutputPass
I20221021 18:11:09.930469 153421 time_util.h:97] [count log]{"loc":"init op graph","mem_rss":"4425.000000 MB","mem_vm":"26190.000000 MB","time_cost":"3568 milliseconds"}
I20221021 18:11:09.969947 153421 time_util.h:97] [count log]{"loc":"init job builder","mem_rss":"4425.000000 MB","mem_vm":"26190.000000 MB","time_cost":"39 milliseconds"}
I20221021 18:11:10.372117 153421 time_util.h:97] [count log]{"loc":"init job builder","mem_rss":"4433.000000 MB","mem_vm":"26199.000000 MB","time_cost":"18 milliseconds"}
I20221021 18:11:10.817144 153421 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_6-FuseAddToOutputPass
I20221021 18:11:10.818809 153421 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseAddToOutputPass","mem_rss":"4433.000000 MB","mem_vm":"26199.000000 MB","time_cost":"4455 milliseconds"}
strint commented 2 years ago

FuseConsecutiveAddPass 时间开销应该是和循环次数相关

这样做两次循环,开销就是其它 pass 的两倍了。

现在的逻辑是 add_n_1 -> add_n_2 -> add_n_3;

add_n_1合并了add_n_2之后,add_n_2就不会合并 add_n_3 了。需要等下一个 pass apply 才会再做合并。

这样如果有 k 个 连续的 add_n,就会执行 ln(k) 次 pass apply 。

这样验证下的确如此后,可以做一个优化,支持一次把连续的 add_n 合并。

hjchen2 commented 2 years ago

FuseConsecutiveAddPass 时间开销应该是和循环次数相关

这样做两次循环,开销就是其它 pass 的两倍了。

I20221021 17:39:30.089967 123852 time_util.h:97] [count log]{"loc":"start","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"0 milliseconds"}
I20221021 17:39:35.753731 123852 time_util.h:97] [count log]{"loc":"create op graph","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"5664 milliseconds"}
I20221021 17:39:35.780948 123852 time_util.h:97] [count log]{"loc":"create job builder","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"27 milliseconds"}
I20221021 17:39:36.257442 123852 time_util.h:97] [count log]{"loc":"fuse_add_apply","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"476 milliseconds"}
I20221021 17:39:36.257534 123852 fuse_consecutive_add_pass.cpp:44] apply cnt 0

I20221021 17:39:36.924232 123852 time_util.h:97] [count log]{"loc":"start","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"665 milliseconds"}
I20221021 17:39:42.702358 123852 time_util.h:97] [count log]{"loc":"create op graph","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"5779 milliseconds"}
I20221021 17:39:42.753391 123852 time_util.h:97] [count log]{"loc":"create job builder","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"51 milliseconds"}
I20221021 17:39:42.771134 123852 time_util.h:97] [count log]{"loc":"fuse_add_apply","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"17 milliseconds"}
I20221021 17:39:42.771194 123852 fuse_consecutive_add_pass.cpp:44] apply cnt 1

一个循环内的主要开销是 op graph 的创建:

const OpGraph op_graph(*job);
{"loc":"create op graph","mem_rss":"4563.000000 MB","mem_vm":"26331.000000 MB","time_cost":"5664 milliseconds"}

可以初始化一次op graph,然后每个pass都直接修改op graph么,最后pass都做完之后再序列化成job

strint commented 2 years ago

FuseConsecutiveAddPass 时间开销应该是和循环次数相关

这样做两次循环,开销就是其它 pass 的两倍了。

之前

FuseConsecutiveAddPass","mem_rss":"4568.000000 MB","mem_vm":"26334.000000 MB","time_cost":"13523 milliseconds"}

改成了一遍 pass 之后

FuseConsecutiveAddPass","mem_rss":"4565.000000 MB","mem_vm":"26316.000000 MB","time_cost":"7200 milliseconds"}

commit : https://github.com/Oneflow-Inc/oneflow/commit/6a4ced1b4a6df2b4ef1522f167e46c863d4a9e01 和其它 pass 开销基本相同了。

strint commented 2 years ago

op graph init 是 pass 的主要开销

其内部各部分开销如下

Maybe<void> OpGraph::Init(const Job& job) {
  auto cost_ct = std::make_unique<TimeCounter<std::chrono::milliseconds>>(true, true);
  InitNodes(job);
  cost_ct->Count("OpGraph0", 1);
  op_name2op_node_.reserve(job.net().op_size());
  ForEachNode([&](OpNode* node) {
    CHECK(op_name2op_node_.emplace(node->op().op_name(), node).second)
        << "op_name: " << node->op().op_name();
  });
  cost_ct->Count("OpGraph1", 1);
  InitEdges();
  cost_ct->Count("OpGraph2", 1);
  InitProducerOpName2CtrlConsumerOpNames(job);
  cost_ct->Count("OpGraph3", 1);
  CheckIsDAG();
  cost_ct->Count("OpGraph4", 1);
  ForEachNode([](OpNode* node) { node->InitLbi2SourceNode(); });
  cost_ct->Count("OpGraph5", 1);
  InferBlobLastUsed();
  cost_ct->Count("OpGraph6", 1);
  InferTimeShape();
  cost_ct->Count("OpGraph7", 1);
  {
    LazyMode::Guard enable_lazy_mode_guard(true);
    JUST(InferLogicalBlobDesc(job));
  }
  cost_ct->Count("OpGraph8", 1);
  return Maybe<void>::Ok();
}
I20221022 00:52:34.335309 450187 time_util.h:97] [count log]{"loc":"OpGraph0","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"689 milliseconds"}
I20221022 00:52:34.343580 450187 time_util.h:97] [count log]{"loc":"OpGraph1","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"10 milliseconds"}
I20221022 00:52:34.520028 450187 time_util.h:97] [count log]{"loc":"OpGraph2","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"176 milliseconds"}
I20221022 00:52:34.521185 450187 time_util.h:97] [count log]{"loc":"OpGraph3","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"1 milliseconds"}
I20221022 00:52:34.653026 450187 time_util.h:97] [count log]{"loc":"OpGraph4","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"131 milliseconds"}
I20221022 00:52:34.668092 450187 time_util.h:97] [count log]{"loc":"OpGraph5","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"15 milliseconds"}
I20221022 00:52:34.722708 450187 time_util.h:97] [count log]{"loc":"OpGraph6","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"54 milliseconds"}
I20221022 00:52:34.758741 450187 time_util.h:97] [count log]{"loc":"OpGraph7","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"36 milliseconds"}
I20221022 00:52:37.151204 450187 time_util.h:97] [count log]{"loc":"OpGraph8","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"2392 milliseconds"}
I20221022 00:52:37.151316 450187 time_util.h:97] [count log]{"loc":"init op graph","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"3507 milliseconds"}

InferLogicalBlobDesc 是主要开销,内部做了 op 遍历和推导,各个推导步骤的开销均匀,没有很好的优化点。 op间也不能并行;

    JUST(InferLogicalBlobDesc(job))
strint commented 2 years ago

新增一下 pass 生效的判断,大部分情况下可以跳过 op graph 的创建

ReplaceEmbeddingOps, 从 16s 减到 3ms; PruneCastToStaticShapeOpsPass,T5需要这个逻辑; FixPipelineStageIdPass,从 10 s 减到 0; PipelineBufferPass,从 9s 减到 0;

commit: https://github.com/Oneflow-Inc/oneflow/commit/ff736fbee0a4d53f68177d7c511a09eccc8098d2

strint commented 2 years ago

https://github.com/Oneflow-Inc/libai/issues/407#issuecomment-1285633528

主要的开销异常的地方已经解决。PR:https://github.com/Oneflow-Inc/oneflow/pull/9281

xiezipeng-ML commented 2 years ago

oneflow version: 0.8.1+cu112.git.ff736fbee0 git_commit: ff736fbee0 cmake_build_type: Debug rdma: False mlir: False 28上8卡,dp2_mp4_pp1

Log file created at: 2022/10/22 03:36:45
Running on machine: oneflow-28
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20221022 03:36:45.135177 4145383 env_global_objects_scope.cpp:167] Using rpc backend: gRPC
I20221022 03:36:45.136451 4145383 ctrl_server.cpp:40] CtrlServer listening on 0.0.0.0:34259
I20221022 03:36:45.136811 4145383 rank_info_bootstrap_server.cpp:47] RankInfoBootstrapServer listening on 0.0.0.0:12345
I20221022 03:36:55.631938 4145383 epoll_comm_network.cpp:63] CommNet:Epoll listening on 0.0.0.0:46793
W20221022 03:37:05.360247 4145383 env_global_objects_scope.cpp:285] Skip init RDMA because RDMA is not compiled!
I20221022 03:39:50.588335 4145383 version.cpp:22] OneFlow git version: v0.8.1-377-gff736fbee0
I20221022 03:39:50.588404 4145383 cuda_device_manager_factory.cpp:63] CUDA runtime version: 11.2
I20221022 03:39:50.588450 4145383 cuda_device_manager_factory.cpp:72] cuDNN version: 8.5.0
I20221022 03:39:50.588457 4145383 cuda_device_manager_factory.cpp:85] NCCL version: 2.12.10
I20221022 03:42:09.839366 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221022 03:43:24.025779 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221022 03:43:24.026322 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InsertPinnedIdentityOpPass","mem_rss":"4640.000000 MB","mem_vm":"56682.000000 MB","time_cost":"74186 milliseconds"}
I20221022 03:43:24.026435 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221022 03:44:42.391759 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221022 03:44:42.393123 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 EliminateDeadNodesPass","mem_rss":"4671.000000 MB","mem_vm":"56712.000000 MB","time_cost":"78365 milliseconds"}
I20221022 03:44:42.393240 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221022 03:44:42.393298 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221022 03:44:42.393368 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 NormalizationExponentialAverageAutoTickPass","mem_rss":"4671.000000 MB","mem_vm":"56712.000000 MB","time_cost":"0 milliseconds"}
I20221022 03:44:42.393404 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_3-AutoMixedPrecision
I20221022 03:46:01.006008 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_3-AutoMixedPrecision
I20221022 03:46:01.007236 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoMixedPrecision","mem_rss":"4671.000000 MB","mem_vm":"56712.000000 MB","time_cost":"78612 milliseconds"}
I20221022 03:46:01.007349 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_4-PruneAmpWhiteIdentityOpPass
I20221022 03:47:26.303113 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_4-PruneAmpWhiteIdentityOpPass
I20221022 03:47:26.304589 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneAmpWhiteIdentityOpPass","mem_rss":"4680.000000 MB","mem_vm":"56722.000000 MB","time_cost":"85295 milliseconds"}
I20221022 03:47:26.304697 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_5-OptimizerPlacementOptimizationPass
I20221022 03:48:51.500874 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_5-OptimizerPlacementOptimizationPass
I20221022 03:48:51.502127 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 OptimizerPlacementOptimizationPass","mem_rss":"4680.000000 MB","mem_vm":"56722.000000 MB","time_cost":"85196 milliseconds"}
I20221022 03:48:51.502249 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_6-FuseAddToOutputPass
I20221022 03:50:13.820356 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_6-FuseAddToOutputPass
I20221022 03:50:13.821825 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseAddToOutputPass","mem_rss":"4698.000000 MB","mem_vm":"56740.000000 MB","time_cost":"82318 milliseconds"}
I20221022 03:50:13.821916 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_7-DynamicLossScaleSchedulePass
I20221022 03:51:38.625154 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_7-DynamicLossScaleSchedulePass
I20221022 03:51:38.626600 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DynamicLossScaleSchedulePass","mem_rss":"4698.000000 MB","mem_vm":"56740.000000 MB","time_cost":"84803 milliseconds"}
I20221022 03:51:38.626708 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_8-AutoTrainStep
I20221022 03:53:03.468026 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_8-AutoTrainStep
I20221022 03:53:03.469501 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoTrainStep","mem_rss":"4698.000000 MB","mem_vm":"56740.000000 MB","time_cost":"84841 milliseconds"}
I20221022 03:53:03.469610 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_9-AutoLearningRate
I20221022 03:54:28.282389 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_9-AutoLearningRate
I20221022 03:54:28.283830 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoLearningRate","mem_rss":"4698.000000 MB","mem_vm":"56740.000000 MB","time_cost":"84812 milliseconds"}
I20221022 03:54:28.283922 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_10-QuantAwareTraining
I20221022 03:54:28.283951 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_10-QuantAwareTraining
I20221022 03:54:28.285116 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 QuantAwareTraining","mem_rss":"4698.000000 MB","mem_vm":"56740.000000 MB","time_cost":"0 milliseconds"}
I20221022 03:54:28.285162 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_11-GenerateOptimizerOpConfs
I20221022 03:55:56.896756 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_11-GenerateOptimizerOpConfs
I20221022 03:55:56.898242 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenerateOptimizerOpConfs","mem_rss":"4699.000000 MB","mem_vm":"56740.000000 MB","time_cost":"88611 milliseconds"}
I20221022 03:55:56.898350 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_12-PrunePinnedIdentityOpPass
I20221022 03:58:17.845922 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_12-PrunePinnedIdentityOpPass
I20221022 03:58:17.847334 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PrunePinnedIdentityOpPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"140947 milliseconds"}
I20221022 03:58:17.847429 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_13-ReplaceEmbeddingOps
I20221022 03:58:17.851245 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_13-ReplaceEmbeddingOps
I20221022 03:58:17.852438 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 ReplaceEmbeddingOps","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"3 milliseconds"}
I20221022 03:58:17.852484 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_14-SequentialOneEmbeddingOpsPass
I20221022 03:58:17.852505 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_14-SequentialOneEmbeddingOpsPass
I20221022 03:58:17.853689 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SequentialOneEmbeddingOpsPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"0 milliseconds"}
I20221022 03:58:17.853731 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_15-FuseEmbeddingShuffleInteractionPass
I20221022 03:58:17.853745 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_15-FuseEmbeddingShuffleInteractionPass
I20221022 03:58:17.854808 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseEmbeddingShuffleInteractionPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"0 milliseconds"}
I20221022 03:58:17.854846 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_16-FuseBCEReduceMeanFwBwPass
I20221022 03:58:17.854859 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_16-FuseBCEReduceMeanFwBwPass
I20221022 03:58:17.855938 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseBCEReduceMeanFwBwPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"0 milliseconds"}
I20221022 03:58:17.855974 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_17-AddSspVariableProxy
I20221022 03:58:17.856009 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_17-AddSspVariableProxy
I20221022 03:58:17.857071 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddSspVariableProxy","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"0 milliseconds"}
I20221022 03:58:17.857110 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_18-CheckpointingPass
I20221022 04:00:33.866252 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_18-CheckpointingPass
I20221022 04:00:33.867687 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CheckpointingPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"136009 milliseconds"}
I20221022 04:00:33.867782 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_19-CudnnFusedNormalizationAddReluPass
I20221022 04:00:33.873327 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_19-CudnnFusedNormalizationAddReluPass
I20221022 04:00:33.874555 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CudnnFusedNormalizationAddReluPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"5 milliseconds"}
I20221022 04:00:33.874605 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_20-PruneCastToStaticShapeOpsPass
I20221022 04:02:50.802913 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_20-PruneCastToStaticShapeOpsPass
I20221022 04:02:50.804263 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneCastToStaticShapeOpsPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"136928 milliseconds"}
I20221022 04:02:50.804360 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_21-FuseAddToOutputPass1
I20221022 04:05:03.671183 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_21-FuseAddToOutputPass1
I20221022 04:05:03.672658 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseAddToOutputPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"132866 milliseconds"}
I20221022 04:05:03.672755 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_22-FuseConsecutiveAddPass
I20221022 04:05:03.674055 4145383 time_util.h:97] [count log]{"loc":"start","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"0 milliseconds"}
I20221022 04:07:14.795783 4145383 time_util.h:97] [count log]{"loc":"create op graph","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"131122 milliseconds"}
I20221022 04:07:14.844141 4145383 time_util.h:97] [count log]{"loc":"create job builder","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"48 milliseconds"}
I20221022 04:07:15.401420 4145383 time_util.h:97] [count log]{"loc":"fuse_add_apply","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"556 milliseconds"}
I20221022 04:07:16.061648 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_22-FuseConsecutiveAddPass
I20221022 04:07:16.063112 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseConsecutiveAddPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"132388 milliseconds"}
I20221022 04:07:16.063203 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_23-IndexedSlicesOptimizerRewritePass
I20221022 04:07:16.063234 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_23-IndexedSlicesOptimizerRewritePass
I20221022 04:07:16.064553 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 IndexedSlicesOptimizerRewritePass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"0 milliseconds"}
I20221022 04:07:16.064599 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_24-SplitSparseSoftmaxCrossEntropyOpPass
I20221022 04:09:27.992506 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_24-SplitSparseSoftmaxCrossEntropyOpPass
I20221022 04:09:27.994026 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SplitSparseSoftmaxCrossEntropyOpPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"131927 milliseconds"}
I20221022 04:09:27.994120 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_25-DoParallelCastBeforeWideningTypeCast
I20221022 04:11:40.110169 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_25-DoParallelCastBeforeWideningTypeCast
I20221022 04:11:40.111547 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DoParallelCastBeforeWideningTypeCast","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"132116 milliseconds"}
I20221022 04:11:40.111640 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_26-FuseCastScalePass
I20221022 04:13:51.355201 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_26-FuseCastScalePass
I20221022 04:13:51.356525 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseCastScalePass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"131243 milliseconds"}
I20221022 04:13:51.356616 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_27-PruneParallelCastOpsPass
I20221022 04:15:59.787802 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_27-PruneParallelCastOpsPass
I20221022 04:15:59.789198 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneParallelCastOpsPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"128431 milliseconds"}
I20221022 04:15:59.789291 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_28-FuseUpdateOpsPass
I20221022 04:17:42.223071 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_28-FuseUpdateOpsPass
I20221022 04:17:42.224488 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseUpdateOpsPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"102433 milliseconds"}
I20221022 04:17:42.224578 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_29-FuseModelUpdateCastOpsPass
I20221022 04:17:42.224611 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_29-FuseModelUpdateCastOpsPass
I20221022 04:17:42.225867 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseModelUpdateCastOpsPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"0 milliseconds"}
I20221022 04:17:42.225915 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_30-MultiTensorModelUpdatePass
I20221022 04:17:42.225950 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_30-MultiTensorModelUpdatePass
I20221022 04:17:42.227128 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiTensorModelUpdatePass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"0 milliseconds"}
I20221022 04:17:42.227169 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_31-FixPipelineStageIdPass
I20221022 04:17:42.227185 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_31-FixPipelineStageIdPass
I20221022 04:17:42.228317 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FixPipelineStageIdPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"0 milliseconds"}
I20221022 04:17:42.228354 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_32-PipelineBufferPass
I20221022 04:17:42.228366 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_32-PipelineBufferPass
I20221022 04:17:42.229512 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PipelineBufferPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"0 milliseconds"}
I20221022 04:17:42.229549 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_33-AutoParallelPass
I20221022 04:17:42.229562 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_33-AutoParallelPass
I20221022 04:17:42.230593 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoParallelPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"0 milliseconds"}
I20221022 04:17:42.230629 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_34-DumpVariableInfoPass
I20221022 04:17:42.230648 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_34-DumpVariableInfoPass
I20221022 04:17:42.231755 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpVariableInfoPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"0 milliseconds"}
I20221022 04:17:42.231792 4145383 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_35-DumpBlobParallelConfPass
I20221022 04:19:04.926564 4145383 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_35-DumpBlobParallelConfPass
I20221022 04:19:04.927960 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"82694 milliseconds"}
I20221022 04:19:04.963881 4145383 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 OptimizationLogicalGraph","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"2215 seconds"}
I20221022 04:19:51.972611 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GroupBoxingByDstParallel","mem_rss":"4879.000000 MB","mem_vm":"56919.000000 MB","time_cost":"16 seconds"}
I20221022 04:20:38.493579 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 BoxingWithMiddleNodes","mem_rss":"4879.000000 MB","mem_vm":"56919.000000 MB","time_cost":"46 seconds"}
I20221022 04:20:50.072129 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SetCtrlInOpName4VariableOp","mem_rss":"4879.000000 MB","mem_vm":"56919.000000 MB","time_cost":"11 seconds"}
I20221022 04:21:04.680253 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoPrependTick","mem_rss":"4879.000000 MB","mem_vm":"56919.000000 MB","time_cost":"14 seconds"}
I20221022 04:21:16.225966 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddTickForTimeShape","mem_rss":"4879.000000 MB","mem_vm":"56919.000000 MB","time_cost":"11 seconds"}
I20221022 04:21:31.300956 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiClientAutoSourceAndSinkTick","mem_rss":"4897.000000 MB","mem_vm":"56937.000000 MB","time_cost":"15 seconds"}
I20221022 04:21:42.908663 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiClientAutoInterfaceCriticalSectionTick","mem_rss":"4897.000000 MB","mem_vm":"56937.000000 MB","time_cost":"11 seconds"}
I20221022 04:21:42.910621 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SystemOpFillJobNamePass","mem_rss":"4897.000000 MB","mem_vm":"56937.000000 MB","time_cost":"0 seconds"}
I20221022 04:21:54.615434 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"4897.000000 MB","mem_vm":"56937.000000 MB","time_cost":"11 seconds"}
I20221022 04:22:15.568470 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InsertNcclLogicalOpPass","mem_rss":"4947.000000 MB","mem_vm":"56987.000000 MB","time_cost":"20 seconds"}
I20221022 04:22:28.606386 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"4947.000000 MB","mem_vm":"56987.000000 MB","time_cost":"13 seconds"}
I20221022 04:22:41.519970 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CheckOpGraph","mem_rss":"4947.000000 MB","mem_vm":"56987.000000 MB","time_cost":"12 seconds"}
I20221022 04:22:41.521317 4145383 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 CompleteJob","mem_rss":"4947.000000 MB","mem_vm":"56987.000000 MB","time_cost":"207 seconds"}
I20221022 04:22:53.735800 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 NewOpGraph","mem_rss":"4947.000000 MB","mem_vm":"56987.000000 MB","time_cost":"12 seconds"}
I20221022 04:24:26.798403 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 BuildTaskGraph","mem_rss":"5392.000000 MB","mem_vm":"57431.000000 MB","time_cost":"93 seconds"}
I20221022 04:24:54.016377 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddTaskToPlan","mem_rss":"6013.000000 MB","mem_vm":"62041.000000 MB","time_cost":"27 seconds"}
I20221022 04:24:58.687100 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InferMemShare","mem_rss":"6025.000000 MB","mem_vm":"62585.000000 MB","time_cost":"4 seconds"}
I20221022 04:25:07.239583 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenMemBlockAndChunk","mem_rss":"6031.000000 MB","mem_vm":"62073.000000 MB","time_cost":"7 seconds"}
I20221022 04:25:07.240767 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 LogPlan","mem_rss":"6031.000000 MB","mem_vm":"62073.000000 MB","time_cost":"0 seconds"}
I20221022 04:25:07.508179 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenRegisterHint","mem_rss":"6031.000000 MB","mem_vm":"62073.000000 MB","time_cost":"0 seconds"}
I20221022 04:25:07.514464 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenCollectiveBoxingPlan","mem_rss":"6031.000000 MB","mem_vm":"62073.000000 MB","time_cost":"0 seconds"}
I20221022 04:25:07.685060 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpCtrlRegstInfoToPlan","mem_rss":"6031.000000 MB","mem_vm":"62073.000000 MB","time_cost":"0 seconds"}
I20221022 04:25:08.162101 4145383 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 0, Device: 0 needs to allocate [ 11852.6 MiB ] device memory. 
   In general, Chunk id: 2  memory is [ 4509.95 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6049 MiB ] with mem_block_num = 9405
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 04:25:08.188794 4145383 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 1, Device: 1 needs to allocate [ 11854.4 MiB ] device memory. 
   In general, Chunk id: 5  memory is [ 4511.8 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.606 MiB ] with mem_block_num = 9411
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 04:25:08.215277 4145383 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 2, Device: 2 needs to allocate [ 11854.4 MiB ] device memory. 
   In general, Chunk id: 7  memory is [ 4511.8 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.606 MiB ] with mem_block_num = 9411
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 04:25:08.241674 4145383 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 3, Device: 3 needs to allocate [ 11854.4 MiB ] device memory. 
   In general, Chunk id: 0  memory is [ 4511.8 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.606 MiB ] with mem_block_num = 9411
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 04:25:08.268170 4145383 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 4, Device: 4 needs to allocate [ 11854.5 MiB ] device memory. 
   In general, Chunk id: 6  memory is [ 4511.8 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 04:25:08.294770 4145383 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 5, Device: 5 needs to allocate [ 11854.5 MiB ] device memory. 
   In general, Chunk id: 4  memory is [ 4511.8 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 04:25:08.320937 4145383 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 6, Device: 6 needs to allocate [ 11852.6 MiB ] device memory. 
   In general, Chunk id: 1  memory is [ 4509.95 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 04:25:08.347235 4145383 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 7, Device: 7 needs to allocate [ 11854.5 MiB ] device memory. 
   In general, Chunk id: 3  memory is [ 4511.8 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 04:25:08.606478 4145383 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenMemAndLightPlanLog","mem_rss":"6031.000000 MB","mem_vm":"62073.000000 MB","time_cost":"0 seconds"}
I20221022 04:25:08.607070 4145383 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 CompilePlan","mem_rss":"6031.000000 MB","mem_vm":"62073.000000 MB","time_cost":"147 seconds"}
I20221022 04:25:19.894229 4145383 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 SyncPlan","mem_rss":"6031.000000 MB","mem_vm":"62073.000000 MB","time_cost":"11 seconds"}
F20221022 04:25:33.602072 4145383 memory_allocator.cpp:50] out of memory
  File "/home/xiezipeng/oneflow/oneflow/core/memory/memory_allocator.cpp", line 50, in Allocate
    device->Alloc(options, &ptr, size)
Error Type: oneflow.ErrorProto.out_of_memory_error

@strint 啸宇 这个ok了

strint commented 2 years ago

缩减 PrunePinnedIdentityOpPass 和 FuseConsecutiveAddPass 开销

I20221020 03:51:52.428287 2539781 ibverbs_comm_network.cpp:177] Connected to peer 6
I20221020 03:54:08.694559 2539781 version.cpp:22] OneFlow git version: 63a3b287

显示 build model 是2分16秒;

OptimizationLogicalGraph","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"513 seconds"}
CompleteJob","mem_rss":"4886.000000 MB","mem_vm":"57810.000000 MB","time_cost":"40 seconds"}
CompilePlan","mem_rss":"5963.000000 MB","mem_vm":"62908.000000 MB","time_cost":"17 seconds"}

OptimizationLogicalGraph 有点慢,8.6分钟,不正常

  • 大部分的 pass 都是10秒级别,不太正常,之前的测评的任务,基本都不到1秒;
  • 下面两个 pass 耗时比其它明显多
[pass cost base]InsertPinnedIdentityOpPass","mem_rss":"4570.000000 MB","mem_vm":"57495.000000 MB","time_cost":"10 seconds"}
PrunePinnedIdentityOpPass","mem_rss":"4687.000000 MB","mem_vm":"57611.000000 MB","time_cost":"127 seconds"}
FuseConsecutiveAddPass","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"97 seconds"}
OptimizationLogicalGraph","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"2215 seconds"}
CompleteJob","mem_rss":"4947.000000 MB","mem_vm":"56987.000000 MB","time_cost":"207 seconds"}
CompilePlan","mem_rss":"6031.000000 MB","mem_vm":"62073.000000 MB","time_cost":"147 seconds"}

其中

[pass ocst base]InsertPinnedIdentityOpPass","mem_rss":"4640.000000 MB","mem_vm":"56682.000000 MB","time_cost":"74186 milliseconds"}
PrunePinnedIdentityOpPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"140947 milliseconds"}
FuseConsecutiveAddPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"132388 milliseconds"}

可以看到相对值减少了,但是绝对值增大了。

I20221022 04:05:03.674055 4145383 time_util.h:97] [count log]{"loc":"start","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"0 milliseconds"}
I20221022 04:07:14.795783 4145383 time_util.h:97] [count log]{"loc":"create op graph","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"131122 milliseconds"} 【op graph 基本开销达到了131s】
I20221022 04:07:14.844141 4145383 time_util.h:97] [count log]{"loc":"create job builder","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"48 milliseconds"}
I20221022 04:07:15.401420 4145383 time_util.h:97] [count log]{"loc":"fuse_add_apply","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"556 milliseconds"}

op graph 基本开销达到了131s,太高了,猜测是机器的 cpu 有别的任务在使用。

说明 PrunePinnedIdentityOpPass 和 FuseConsecutiveAddPass 已经降低到了普通 pass 的开销程度。

需要再跑一下,确认机器没有别人在使用。

缩减无用的pass 创建 op graph

之前

ReplaceEmbeddingOps","mem_rss":"4687.000000 MB","mem_vm":"57611.000000 MB","time_cost":"16 seconds"}
FixPipelineStageIdPass","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"11 seconds"}
PipelineBufferPass","mem_rss":"4700.000000 MB","mem_vm":"57623.000000 MB","time_cost":"11 seconds"}

现在

ReplaceEmbeddingOps","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"3 milliseconds"}
FixPipelineStageIdPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"0 milliseconds"}
PipelineBufferPass","mem_rss":"4797.000000 MB","mem_vm":"56838.000000 MB","time_cost":"0 milliseconds"}

达到了预期效果

xiezipeng-ML commented 2 years ago

2215.0827048670035 compile graph time....................................... oneflow version: 0.8.1+cu112.git.ff736fbee0 git_commit: ff736fbee0 cmake_build_type: Debug rdma: False mlir: False 28上8卡,dp2_mp4_pp1

Log file created at: 2022/10/22 05:35:21
Running on machine: oneflow-28
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20221022 05:35:21.386425 4154418 env_global_objects_scope.cpp:167] Using rpc backend: gRPC
I20221022 05:35:21.387948 4154418 ctrl_server.cpp:40] CtrlServer listening on 0.0.0.0:33839
I20221022 05:35:21.388311 4154418 rank_info_bootstrap_server.cpp:47] RankInfoBootstrapServer listening on 0.0.0.0:12345
I20221022 05:35:21.963438 4154418 epoll_comm_network.cpp:63] CommNet:Epoll listening on 0.0.0.0:45327
W20221022 05:35:31.132243 4154418 env_global_objects_scope.cpp:285] Skip init RDMA because RDMA is not compiled!
I20221022 05:38:13.956075 4154418 version.cpp:22] OneFlow git version: v0.8.1-377-gff736fbee0
I20221022 05:38:13.956125 4154418 cuda_device_manager_factory.cpp:63] CUDA runtime version: 11.2
I20221022 05:38:13.956153 4154418 cuda_device_manager_factory.cpp:72] cuDNN version: 8.5.0
I20221022 05:38:13.956161 4154418 cuda_device_manager_factory.cpp:85] NCCL version: 2.12.10
I20221022 05:40:32.678282 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221022 05:41:46.822155 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221022 05:41:46.822685 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InsertPinnedIdentityOpPass","mem_rss":"4641.000000 MB","mem_vm":"56691.000000 MB","time_cost":"74143 milliseconds"}
I20221022 05:41:46.822794 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221022 05:43:04.938809 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221022 05:43:04.940133 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 EliminateDeadNodesPass","mem_rss":"4671.000000 MB","mem_vm":"56722.000000 MB","time_cost":"78116 milliseconds"}
I20221022 05:43:04.940243 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221022 05:43:04.940302 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221022 05:43:04.940371 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 NormalizationExponentialAverageAutoTickPass","mem_rss":"4671.000000 MB","mem_vm":"56722.000000 MB","time_cost":"0 milliseconds"}
I20221022 05:43:04.940397 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_3-AutoMixedPrecision
I20221022 05:44:23.309800 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_3-AutoMixedPrecision
I20221022 05:44:23.311022 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoMixedPrecision","mem_rss":"4672.000000 MB","mem_vm":"56722.000000 MB","time_cost":"78369 milliseconds"}
I20221022 05:44:23.311136 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_4-PruneAmpWhiteIdentityOpPass
I20221022 05:45:48.394618 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_4-PruneAmpWhiteIdentityOpPass
I20221022 05:45:48.396054 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneAmpWhiteIdentityOpPass","mem_rss":"4681.000000 MB","mem_vm":"56731.000000 MB","time_cost":"85083 milliseconds"}
I20221022 05:45:48.396167 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_5-OptimizerPlacementOptimizationPass
I20221022 05:47:13.555724 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_5-OptimizerPlacementOptimizationPass
I20221022 05:47:13.556977 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 OptimizerPlacementOptimizationPass","mem_rss":"4681.000000 MB","mem_vm":"56731.000000 MB","time_cost":"85159 milliseconds"}
I20221022 05:47:13.557104 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_6-FuseAddToOutputPass
I20221022 05:48:35.749186 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_6-FuseAddToOutputPass
I20221022 05:48:35.750627 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseAddToOutputPass","mem_rss":"4699.000000 MB","mem_vm":"56749.000000 MB","time_cost":"82192 milliseconds"}
I20221022 05:48:35.750733 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_7-DynamicLossScaleSchedulePass
I20221022 05:50:00.436586 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_7-DynamicLossScaleSchedulePass
I20221022 05:50:00.438042 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DynamicLossScaleSchedulePass","mem_rss":"4699.000000 MB","mem_vm":"56749.000000 MB","time_cost":"84685 milliseconds"}
I20221022 05:50:00.438153 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_8-AutoTrainStep
I20221022 05:51:25.262576 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_8-AutoTrainStep
I20221022 05:51:25.264007 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoTrainStep","mem_rss":"4699.000000 MB","mem_vm":"56749.000000 MB","time_cost":"84824 milliseconds"}
I20221022 05:51:25.264115 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_9-AutoLearningRate
I20221022 05:52:49.971565 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_9-AutoLearningRate
I20221022 05:52:49.974534 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoLearningRate","mem_rss":"4699.000000 MB","mem_vm":"56749.000000 MB","time_cost":"84707 milliseconds"}
I20221022 05:52:49.974646 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_10-QuantAwareTraining
I20221022 05:52:49.974682 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_10-QuantAwareTraining
I20221022 05:52:49.975986 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 QuantAwareTraining","mem_rss":"4699.000000 MB","mem_vm":"56749.000000 MB","time_cost":"0 milliseconds"}
I20221022 05:52:49.976032 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_11-GenerateOptimizerOpConfs
I20221022 05:54:18.324594 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_11-GenerateOptimizerOpConfs
I20221022 05:54:18.326074 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenerateOptimizerOpConfs","mem_rss":"4699.000000 MB","mem_vm":"56749.000000 MB","time_cost":"88348 milliseconds"}
I20221022 05:54:18.326180 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_12-PrunePinnedIdentityOpPass
I20221022 05:56:38.821277 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_12-PrunePinnedIdentityOpPass
I20221022 05:56:38.822690 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PrunePinnedIdentityOpPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"140495 milliseconds"}
I20221022 05:56:38.822783 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_13-ReplaceEmbeddingOps
I20221022 05:56:38.826603 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_13-ReplaceEmbeddingOps
I20221022 05:56:38.827806 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 ReplaceEmbeddingOps","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"3 milliseconds"}
I20221022 05:56:38.827850 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_14-SequentialOneEmbeddingOpsPass
I20221022 05:56:38.827872 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_14-SequentialOneEmbeddingOpsPass
I20221022 05:56:38.829057 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SequentialOneEmbeddingOpsPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"0 milliseconds"}
I20221022 05:56:38.829097 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_15-FuseEmbeddingShuffleInteractionPass
I20221022 05:56:38.829113 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_15-FuseEmbeddingShuffleInteractionPass
I20221022 05:56:38.830181 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseEmbeddingShuffleInteractionPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"0 milliseconds"}
I20221022 05:56:38.830219 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_16-FuseBCEReduceMeanFwBwPass
I20221022 05:56:38.830230 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_16-FuseBCEReduceMeanFwBwPass
I20221022 05:56:38.831312 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseBCEReduceMeanFwBwPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"0 milliseconds"}
I20221022 05:56:38.831348 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_17-AddSspVariableProxy
I20221022 05:56:38.831384 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_17-AddSspVariableProxy
I20221022 05:56:38.832442 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddSspVariableProxy","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"0 milliseconds"}
I20221022 05:56:38.832479 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_18-CheckpointingPass
I20221022 05:58:54.473637 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_18-CheckpointingPass
I20221022 05:58:54.475041 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CheckpointingPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"135641 milliseconds"}
I20221022 05:58:54.475134 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_19-CudnnFusedNormalizationAddReluPass
I20221022 05:58:54.480680 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_19-CudnnFusedNormalizationAddReluPass
I20221022 05:58:54.481894 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CudnnFusedNormalizationAddReluPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"5 milliseconds"}
I20221022 05:58:54.481945 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_20-PruneCastToStaticShapeOpsPass
I20221022 06:01:11.140111 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_20-PruneCastToStaticShapeOpsPass
I20221022 06:01:11.141501 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneCastToStaticShapeOpsPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"136658 milliseconds"}
I20221022 06:01:11.141592 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_21-FuseAddToOutputPass1
I20221022 06:03:23.708871 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_21-FuseAddToOutputPass1
I20221022 06:03:23.710299 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseAddToOutputPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"132567 milliseconds"}
I20221022 06:03:23.710394 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_22-FuseConsecutiveAddPass
I20221022 06:03:23.711709 4154418 time_util.h:97] [count log]{"loc":"start","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"0 milliseconds"}
I20221022 06:05:34.476672 4154418 time_util.h:97] [count log]{"loc":"create op graph","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"130765 milliseconds"}
I20221022 06:05:34.524835 4154418 time_util.h:97] [count log]{"loc":"create job builder","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"48 milliseconds"}
I20221022 06:05:35.078070 4154418 time_util.h:97] [count log]{"loc":"fuse_add_apply","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"552 milliseconds"}
I20221022 06:05:35.744024 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_22-FuseConsecutiveAddPass
I20221022 06:05:35.745489 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseConsecutiveAddPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"132033 milliseconds"}
I20221022 06:05:35.745585 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_23-IndexedSlicesOptimizerRewritePass
I20221022 06:05:35.745621 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_23-IndexedSlicesOptimizerRewritePass
I20221022 06:05:35.746891 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 IndexedSlicesOptimizerRewritePass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"0 milliseconds"}
I20221022 06:05:35.746944 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_24-SplitSparseSoftmaxCrossEntropyOpPass
I20221022 06:07:47.625555 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_24-SplitSparseSoftmaxCrossEntropyOpPass
I20221022 06:07:47.628652 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SplitSparseSoftmaxCrossEntropyOpPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"131878 milliseconds"}
I20221022 06:07:47.628754 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_25-DoParallelCastBeforeWideningTypeCast
I20221022 06:09:59.819835 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_25-DoParallelCastBeforeWideningTypeCast
I20221022 06:09:59.821261 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DoParallelCastBeforeWideningTypeCast","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"132191 milliseconds"}
I20221022 06:09:59.821353 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_26-FuseCastScalePass
I20221022 06:12:11.193187 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_26-FuseCastScalePass
I20221022 06:12:11.194516 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseCastScalePass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"131371 milliseconds"}
I20221022 06:12:11.194612 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_27-PruneParallelCastOpsPass
I20221022 06:14:19.711030 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_27-PruneParallelCastOpsPass
I20221022 06:14:19.712443 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneParallelCastOpsPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"128516 milliseconds"}
I20221022 06:14:19.712538 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_28-FuseUpdateOpsPass
I20221022 06:16:02.211665 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_28-FuseUpdateOpsPass
I20221022 06:16:02.213161 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseUpdateOpsPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"102499 milliseconds"}
I20221022 06:16:02.213255 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_29-FuseModelUpdateCastOpsPass
I20221022 06:16:02.213294 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_29-FuseModelUpdateCastOpsPass
I20221022 06:16:02.214627 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseModelUpdateCastOpsPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"0 milliseconds"}
I20221022 06:16:02.214679 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_30-MultiTensorModelUpdatePass
I20221022 06:16:02.214713 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_30-MultiTensorModelUpdatePass
I20221022 06:16:02.215942 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiTensorModelUpdatePass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"0 milliseconds"}
I20221022 06:16:02.215982 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_31-FixPipelineStageIdPass
I20221022 06:16:02.215996 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_31-FixPipelineStageIdPass
I20221022 06:16:02.217185 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FixPipelineStageIdPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"0 milliseconds"}
I20221022 06:16:02.217223 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_32-PipelineBufferPass
I20221022 06:16:02.217236 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_32-PipelineBufferPass
I20221022 06:16:02.218402 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PipelineBufferPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"0 milliseconds"}
I20221022 06:16:02.218443 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_33-AutoParallelPass
I20221022 06:16:02.218456 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_33-AutoParallelPass
I20221022 06:16:02.219581 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoParallelPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"0 milliseconds"}
I20221022 06:16:02.219619 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_34-DumpVariableInfoPass
I20221022 06:16:02.219638 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_34-DumpVariableInfoPass
I20221022 06:16:02.220770 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpVariableInfoPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"0 milliseconds"}
I20221022 06:16:02.220808 4154418 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_35-DumpBlobParallelConfPass
I20221022 06:17:24.803721 4154418 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_35-DumpBlobParallelConfPass
I20221022 06:17:24.805183 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"82583 milliseconds"}
I20221022 06:17:24.840914 4154418 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 OptimizationLogicalGraph","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"2212 seconds"}
I20221022 06:18:25.645573 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GroupBoxingByDstParallel","mem_rss":"4879.000000 MB","mem_vm":"56928.000000 MB","time_cost":"16 seconds"}
I20221022 06:19:12.214596 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 BoxingWithMiddleNodes","mem_rss":"4879.000000 MB","mem_vm":"56928.000000 MB","time_cost":"46 seconds"}
I20221022 06:19:23.773802 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SetCtrlInOpName4VariableOp","mem_rss":"4879.000000 MB","mem_vm":"56928.000000 MB","time_cost":"11 seconds"}
I20221022 06:19:38.333357 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoPrependTick","mem_rss":"4879.000000 MB","mem_vm":"56928.000000 MB","time_cost":"14 seconds"}
I20221022 06:19:49.823163 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddTickForTimeShape","mem_rss":"4879.000000 MB","mem_vm":"56928.000000 MB","time_cost":"11 seconds"}
I20221022 06:20:04.942091 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiClientAutoSourceAndSinkTick","mem_rss":"4898.000000 MB","mem_vm":"56946.000000 MB","time_cost":"15 seconds"}
I20221022 06:20:16.437093 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiClientAutoInterfaceCriticalSectionTick","mem_rss":"4898.000000 MB","mem_vm":"56946.000000 MB","time_cost":"11 seconds"}
I20221022 06:20:16.439081 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SystemOpFillJobNamePass","mem_rss":"4898.000000 MB","mem_vm":"56946.000000 MB","time_cost":"0 seconds"}
I20221022 06:20:28.202147 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"4898.000000 MB","mem_vm":"56946.000000 MB","time_cost":"11 seconds"}
I20221022 06:20:49.055743 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InsertNcclLogicalOpPass","mem_rss":"4947.000000 MB","mem_vm":"56996.000000 MB","time_cost":"20 seconds"}
I20221022 06:21:02.174326 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"4947.000000 MB","mem_vm":"56996.000000 MB","time_cost":"13 seconds"}
I20221022 06:21:15.132742 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CheckOpGraph","mem_rss":"4947.000000 MB","mem_vm":"56996.000000 MB","time_cost":"12 seconds"}
I20221022 06:21:15.134864 4154418 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 CompleteJob","mem_rss":"4947.000000 MB","mem_vm":"56996.000000 MB","time_cost":"207 seconds"}
I20221022 06:21:27.789683 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 NewOpGraph","mem_rss":"4947.000000 MB","mem_vm":"56996.000000 MB","time_cost":"12 seconds"}
I20221022 06:23:02.981364 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 BuildTaskGraph","mem_rss":"5393.000000 MB","mem_vm":"57441.000000 MB","time_cost":"95 seconds"}
I20221022 06:23:30.783481 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddTaskToPlan","mem_rss":"6013.000000 MB","mem_vm":"62050.000000 MB","time_cost":"27 seconds"}
I20221022 06:23:35.383213 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InferMemShare","mem_rss":"6025.000000 MB","mem_vm":"62594.000000 MB","time_cost":"4 seconds"}
I20221022 06:23:43.919066 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenMemBlockAndChunk","mem_rss":"6032.000000 MB","mem_vm":"62082.000000 MB","time_cost":"7 seconds"}
I20221022 06:23:43.920292 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 LogPlan","mem_rss":"6032.000000 MB","mem_vm":"62082.000000 MB","time_cost":"0 seconds"}
I20221022 06:23:44.194475 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenRegisterHint","mem_rss":"6032.000000 MB","mem_vm":"62082.000000 MB","time_cost":"0 seconds"}
I20221022 06:23:44.200961 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenCollectiveBoxingPlan","mem_rss":"6032.000000 MB","mem_vm":"62082.000000 MB","time_cost":"0 seconds"}
I20221022 06:23:44.376152 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpCtrlRegstInfoToPlan","mem_rss":"6032.000000 MB","mem_vm":"62082.000000 MB","time_cost":"0 seconds"}
I20221022 06:23:44.860862 4154418 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 0, Device: 0 needs to allocate [ 11852.6 MiB ] device memory. 
   In general, Chunk id: 0  memory is [ 4510 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6049 MiB ] with mem_block_num = 9405
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 06:23:44.887748 4154418 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 1, Device: 1 needs to allocate [ 11852.6 MiB ] device memory. 
   In general, Chunk id: 4  memory is [ 4510 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.606 MiB ] with mem_block_num = 9411
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 06:23:44.914260 4154418 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 2, Device: 2 needs to allocate [ 11852.6 MiB ] device memory. 
   In general, Chunk id: 2  memory is [ 4510 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.606 MiB ] with mem_block_num = 9411
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 06:23:44.940851 4154418 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 3, Device: 3 needs to allocate [ 11852.6 MiB ] device memory. 
   In general, Chunk id: 1  memory is [ 4510 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.606 MiB ] with mem_block_num = 9411
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 06:23:44.967319 4154418 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 4, Device: 4 needs to allocate [ 11854.9 MiB ] device memory. 
   In general, Chunk id: 3  memory is [ 4512.23 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 06:23:44.993511 4154418 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 5, Device: 5 needs to allocate [ 11852.7 MiB ] device memory. 
   In general, Chunk id: 6  memory is [ 4510 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 06:23:45.019721 4154418 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 6, Device: 6 needs to allocate [ 11852.7 MiB ] device memory. 
   In general, Chunk id: 5  memory is [ 4510 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 06:23:45.046115 4154418 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 7, Device: 7 needs to allocate [ 11852.7 MiB ] device memory. 
   In general, Chunk id: 7  memory is [ 4510 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 06:23:45.309185 4154418 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenMemAndLightPlanLog","mem_rss":"6032.000000 MB","mem_vm":"62082.000000 MB","time_cost":"0 seconds"}
I20221022 06:23:45.309792 4154418 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 CompilePlan","mem_rss":"6032.000000 MB","mem_vm":"62082.000000 MB","time_cost":"150 seconds"}
I20221022 06:23:56.358386 4154418 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 SyncPlan","mem_rss":"6032.000000 MB","mem_vm":"62082.000000 MB","time_cost":"11 seconds"}
F20221022 06:24:10.314756 4154418 memory_allocator.cpp:50] out of memory
  File "/home/xiezipeng/oneflow/oneflow/core/memory/memory_allocator.cpp", line 50, in Allocate
    device->Alloc(options, &ptr, size)
Error Type: oneflow.ErrorProto.out_of_memory_error
strint commented 2 years ago

05:40:32 ~ 06:17:24 = 37 分钟 = 2220s OptimizationLogicalGraph","mem_rss":"4797.000000 MB","mem_vm":"56847.000000 MB","time_cost":"2212 seconds"

这时间有点奇怪,和base差异很大。https://github.com/Oneflow-Inc/libai/issues/407#issuecomment-1284914484

@xiezipeng-ML 再使用提供给 idea 的 oneflow 版本跑下看看:

python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/release/compile_cost_cnt/cu112
xiezipeng-ML commented 2 years ago
compile graph time 482.94916362699587
version: 0.8.1+cu112.git.63a3b287
git_commit: 63a3b287
cmake_build_type: Release
rdma: True
mlir: True

Log file created at: 2022/10/22 10:55:22
Running on machine: oneflow-28
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20221022 10:55:22.566063 14175 env_global_objects_scope.cpp:167] Using rpc backend: gRPC
I20221022 10:55:22.566787 14175 ctrl_server.cpp:40] CtrlServer listening on 0.0.0.0:36307
I20221022 10:55:22.566893 14175 rank_info_bootstrap_server.cpp:47] RankInfoBootstrapServer listening on 0.0.0.0:12345
I20221022 10:55:23.123452 14175 epoll_comm_network.cpp:63] CommNet:Epoll listening on 0.0.0.0:45343
I20221022 10:55:32.340703 14175 ibverbs_comm_network.cpp:147] Using IB device mlx5_0 port 1 gid index 0
I20221022 10:55:39.492198 14175 ibverbs_comm_network.cpp:177] Connected to peer 7
I20221022 10:55:39.553431 14175 ibverbs_comm_network.cpp:177] Connected to peer 2
I20221022 10:55:39.609606 14175 ibverbs_comm_network.cpp:177] Connected to peer 1
I20221022 10:55:39.667016 14175 ibverbs_comm_network.cpp:177] Connected to peer 3
I20221022 10:55:39.728710 14175 ibverbs_comm_network.cpp:177] Connected to peer 4
I20221022 10:55:39.784607 14175 ibverbs_comm_network.cpp:177] Connected to peer 5
I20221022 10:55:39.833755 14175 ibverbs_comm_network.cpp:177] Connected to peer 6
I20221022 10:57:58.425228 14175 version.cpp:22] OneFlow git version: 63a3b287
I20221022 10:57:58.425261 14175 cuda_device_manager_factory.cpp:63] CUDA runtime version: 11.2
I20221022 10:57:58.425271 14175 cuda_device_manager_factory.cpp:72] cuDNN version: 8.1.1
I20221022 10:57:58.425278 14175 cuda_device_manager_factory.cpp:85] NCCL version: 2.12.10
I20221022 10:57:59.410562 14175 ibverbs_comm_network.cpp:147] Using IB device mlx5_0 port 1 gid index 0
I20221022 10:57:59.584656 14175 ibverbs_comm_network.cpp:177] Connected to peer 7
I20221022 10:57:59.585238 14175 ibverbs_comm_network.cpp:177] Connected to peer 2
I20221022 10:57:59.585676 14175 ibverbs_comm_network.cpp:177] Connected to peer 1
I20221022 10:57:59.586221 14175 ibverbs_comm_network.cpp:177] Connected to peer 3
I20221022 10:57:59.587119 14175 ibverbs_comm_network.cpp:177] Connected to peer 4
I20221022 10:57:59.587698 14175 ibverbs_comm_network.cpp:177] Connected to peer 5
I20221022 10:57:59.588161 14175 ibverbs_comm_network.cpp:177] Connected to peer 6
I20221022 10:58:23.431185 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221022 10:58:31.888835 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221022 10:58:31.889170 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InsertPinnedIdentityOpPass","mem_rss":"4497.000000 MB","mem_vm":"57407.000000 MB","time_cost":"8 seconds"}
I20221022 10:58:31.889187 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221022 10:58:40.873653 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221022 10:58:40.874619 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 EliminateDeadNodesPass","mem_rss":"4518.000000 MB","mem_vm":"57428.000000 MB","time_cost":"8 seconds"}
I20221022 10:58:40.874642 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221022 10:58:40.874658 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221022 10:58:40.874689 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 NormalizationExponentialAverageAutoTickPass","mem_rss":"4518.000000 MB","mem_vm":"57428.000000 MB","time_cost":"0 seconds"}
I20221022 10:58:40.874707 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_3-AutoMixedPrecision
I20221022 10:58:49.803017 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_3-AutoMixedPrecision
I20221022 10:58:49.804092 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoMixedPrecision","mem_rss":"4518.000000 MB","mem_vm":"57428.000000 MB","time_cost":"8 seconds"}
I20221022 10:58:49.804131 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_4-PruneAmpWhiteIdentityOpPass
I20221022 10:58:59.530297 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_4-PruneAmpWhiteIdentityOpPass
I20221022 10:58:59.531471 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneAmpWhiteIdentityOpPass","mem_rss":"4525.000000 MB","mem_vm":"57435.000000 MB","time_cost":"9 seconds"}
I20221022 10:58:59.531494 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_5-OptimizerPlacementOptimizationPass
I20221022 10:59:09.231950 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_5-OptimizerPlacementOptimizationPass
I20221022 10:59:09.232913 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 OptimizerPlacementOptimizationPass","mem_rss":"4525.000000 MB","mem_vm":"57435.000000 MB","time_cost":"9 seconds"}
I20221022 10:59:09.232935 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_6-FuseAddToOutputPass
I20221022 10:59:18.671962 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_6-FuseAddToOutputPass
I20221022 10:59:18.673094 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseAddToOutputPass","mem_rss":"4540.000000 MB","mem_vm":"57450.000000 MB","time_cost":"9 seconds"}
I20221022 10:59:18.673116 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_7-IRRoundTripBeforeAD
I20221022 10:59:18.673132 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_7-IRRoundTripBeforeAD
I20221022 10:59:18.674122 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 IRRoundTripBeforeAD","mem_rss":"4540.000000 MB","mem_vm":"57450.000000 MB","time_cost":"0 seconds"}
I20221022 10:59:18.674139 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_8-DynamicLossScaleSchedulePass
I20221022 10:59:28.306999 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_8-DynamicLossScaleSchedulePass
I20221022 10:59:28.308109 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DynamicLossScaleSchedulePass","mem_rss":"4540.000000 MB","mem_vm":"57450.000000 MB","time_cost":"9 seconds"}
I20221022 10:59:28.308131 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_9-AutoTrainStep
I20221022 10:59:37.948289 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_9-AutoTrainStep
I20221022 10:59:37.949409 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoTrainStep","mem_rss":"4540.000000 MB","mem_vm":"57450.000000 MB","time_cost":"9 seconds"}
I20221022 10:59:37.949430 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_10-AutoLearningRate
I20221022 10:59:47.565222 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_10-AutoLearningRate
I20221022 10:59:47.566335 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoLearningRate","mem_rss":"4540.000000 MB","mem_vm":"57450.000000 MB","time_cost":"9 seconds"}
I20221022 10:59:47.566360 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_11-QuantAwareTraining
I20221022 10:59:47.566371 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_11-QuantAwareTraining
I20221022 10:59:47.567209 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 QuantAwareTraining","mem_rss":"4540.000000 MB","mem_vm":"57450.000000 MB","time_cost":"0 seconds"}
I20221022 10:59:47.567224 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_12-GenerateOptimizerOpConfs
I20221022 10:59:57.560886 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_12-GenerateOptimizerOpConfs
I20221022 10:59:57.562186 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenerateOptimizerOpConfs","mem_rss":"4540.000000 MB","mem_vm":"57450.000000 MB","time_cost":"9 seconds"}
I20221022 10:59:57.562211 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_13-PrunePinnedIdentityOpPass
I20221022 11:02:16.803112 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_13-PrunePinnedIdentityOpPass
I20221022 11:02:16.803378 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PrunePinnedIdentityOpPass","mem_rss":"4616.000000 MB","mem_vm":"57526.000000 MB","time_cost":"139 seconds"}
I20221022 11:02:16.803396 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_14-ReplaceEmbeddingOps
I20221022 11:02:32.237046 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_14-ReplaceEmbeddingOps
I20221022 11:02:32.238175 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 ReplaceEmbeddingOps","mem_rss":"4616.000000 MB","mem_vm":"57526.000000 MB","time_cost":"15 seconds"}
I20221022 11:02:32.238198 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_15-SequentialOneEmbeddingOpsPass
I20221022 11:02:32.238214 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_15-SequentialOneEmbeddingOpsPass
I20221022 11:02:32.239285 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SequentialOneEmbeddingOpsPass","mem_rss":"4616.000000 MB","mem_vm":"57526.000000 MB","time_cost":"0 seconds"}
I20221022 11:02:32.239301 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_16-FuseEmbeddingShuffleInteractionPass
I20221022 11:02:32.239310 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_16-FuseEmbeddingShuffleInteractionPass
I20221022 11:02:32.240469 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseEmbeddingShuffleInteractionPass","mem_rss":"4616.000000 MB","mem_vm":"57526.000000 MB","time_cost":"0 seconds"}
I20221022 11:02:32.240484 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_17-FuseBCEReduceMeanFwBwPass
I20221022 11:02:32.240491 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_17-FuseBCEReduceMeanFwBwPass
I20221022 11:02:32.241539 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseBCEReduceMeanFwBwPass","mem_rss":"4616.000000 MB","mem_vm":"57526.000000 MB","time_cost":"0 seconds"}
I20221022 11:02:32.241554 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_18-AddSspVariableProxy
I20221022 11:02:32.241571 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_18-AddSspVariableProxy
I20221022 11:02:32.242605 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddSspVariableProxy","mem_rss":"4616.000000 MB","mem_vm":"57526.000000 MB","time_cost":"0 seconds"}
I20221022 11:02:32.242619 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_19-CheckpointingPass
I20221022 11:02:47.658746 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_19-CheckpointingPass
I20221022 11:02:47.659873 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CheckpointingPass","mem_rss":"4616.000000 MB","mem_vm":"57526.000000 MB","time_cost":"15 seconds"}
I20221022 11:02:47.659895 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_20-CudnnFusedNormalizationAddReluPass
I20221022 11:02:47.664316 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_20-CudnnFusedNormalizationAddReluPass
I20221022 11:02:47.665417 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CudnnFusedNormalizationAddReluPass","mem_rss":"4616.000000 MB","mem_vm":"57526.000000 MB","time_cost":"0 seconds"}
I20221022 11:02:47.665449 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_21-PruneCastToStaticShapeOpsPass
I20221022 11:03:03.287129 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_21-PruneCastToStaticShapeOpsPass
I20221022 11:03:03.288264 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneCastToStaticShapeOpsPass","mem_rss":"4616.000000 MB","mem_vm":"57526.000000 MB","time_cost":"15 seconds"}
I20221022 11:03:03.288287 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_22-IRRoundTrip
I20221022 11:03:03.288307 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_22-IRRoundTrip
I20221022 11:03:03.289397 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 IRRoundTrip","mem_rss":"4616.000000 MB","mem_vm":"57526.000000 MB","time_cost":"0 seconds"}
I20221022 11:03:03.289413 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_23-FuseAddToOutputPass1
I20221022 11:03:18.397606 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_23-FuseAddToOutputPass1
I20221022 11:03:18.398777 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseAddToOutputPass","mem_rss":"4619.000000 MB","mem_vm":"57529.000000 MB","time_cost":"15 seconds"}
I20221022 11:03:18.398799 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_24-FuseConsecutiveAddPass
I20221022 11:04:48.886399 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_24-FuseConsecutiveAddPass
I20221022 11:04:48.887552 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseConsecutiveAddPass","mem_rss":"4619.000000 MB","mem_vm":"57529.000000 MB","time_cost":"90 seconds"}
I20221022 11:04:48.887573 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_25-IndexedSlicesOptimizerRewritePass
I20221022 11:04:48.887588 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_25-IndexedSlicesOptimizerRewritePass
I20221022 11:04:48.888661 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 IndexedSlicesOptimizerRewritePass","mem_rss":"4619.000000 MB","mem_vm":"57529.000000 MB","time_cost":"0 seconds"}
I20221022 11:04:48.888677 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_26-SplitSparseSoftmaxCrossEntropyOpPass
I20221022 11:05:04.100203 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_26-SplitSparseSoftmaxCrossEntropyOpPass
I20221022 11:05:04.101336 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SplitSparseSoftmaxCrossEntropyOpPass","mem_rss":"4619.000000 MB","mem_vm":"57529.000000 MB","time_cost":"15 seconds"}
I20221022 11:05:04.101358 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_27-DoParallelCastBeforeWideningTypeCast
I20221022 11:05:19.041433 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_27-DoParallelCastBeforeWideningTypeCast
I20221022 11:05:19.042604 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DoParallelCastBeforeWideningTypeCast","mem_rss":"4619.000000 MB","mem_vm":"57529.000000 MB","time_cost":"14 seconds"}
I20221022 11:05:19.042626 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_28-FuseCastScalePass
I20221022 11:05:34.012495 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_28-FuseCastScalePass
I20221022 11:05:34.013909 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseCastScalePass","mem_rss":"4619.000000 MB","mem_vm":"57529.000000 MB","time_cost":"14 seconds"}
I20221022 11:05:34.013933 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_29-PruneParallelCastOpsPass
I20221022 11:05:48.619318 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_29-PruneParallelCastOpsPass
I20221022 11:05:48.620549 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneParallelCastOpsPass","mem_rss":"4619.000000 MB","mem_vm":"57529.000000 MB","time_cost":"14 seconds"}
I20221022 11:05:48.620573 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_30-FuseUpdateOpsPass
I20221022 11:06:00.374001 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_30-FuseUpdateOpsPass
I20221022 11:06:00.375268 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseUpdateOpsPass","mem_rss":"4619.000000 MB","mem_vm":"57529.000000 MB","time_cost":"11 seconds"}
I20221022 11:06:00.375291 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_31-FuseModelUpdateCastOpsPass
I20221022 11:06:00.375303 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_31-FuseModelUpdateCastOpsPass
I20221022 11:06:00.375629 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseModelUpdateCastOpsPass","mem_rss":"4619.000000 MB","mem_vm":"57529.000000 MB","time_cost":"0 seconds"}
I20221022 11:06:00.375639 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_32-MultiTensorModelUpdatePass
I20221022 11:06:00.375645 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_32-MultiTensorModelUpdatePass
I20221022 11:06:00.376771 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiTensorModelUpdatePass","mem_rss":"4619.000000 MB","mem_vm":"57529.000000 MB","time_cost":"0 seconds"}
I20221022 11:06:00.376785 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_33-FixPipelineStageIdPass
I20221022 11:06:09.759965 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_33-FixPipelineStageIdPass
I20221022 11:06:09.761188 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FixPipelineStageIdPass","mem_rss":"4619.000000 MB","mem_vm":"57529.000000 MB","time_cost":"9 seconds"}
I20221022 11:06:09.761210 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_34-PipelineBufferPass
I20221022 11:06:19.146199 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_34-PipelineBufferPass
I20221022 11:06:19.148458 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PipelineBufferPass","mem_rss":"4619.000000 MB","mem_vm":"57529.000000 MB","time_cost":"9 seconds"}
I20221022 11:06:19.148483 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_35-AutoParallelPass
I20221022 11:06:19.148499 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_35-AutoParallelPass
I20221022 11:06:19.150429 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoParallelPass","mem_rss":"4619.000000 MB","mem_vm":"57529.000000 MB","time_cost":"0 seconds"}
I20221022 11:06:19.150446 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_36-DumpVariableInfoPass
I20221022 11:06:19.150455 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_36-DumpVariableInfoPass
I20221022 11:06:19.151705 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpVariableInfoPass","mem_rss":"4619.000000 MB","mem_vm":"57529.000000 MB","time_cost":"0 seconds"}
I20221022 11:06:19.151717 14175 job_build_and_infer_ctx.cpp:943] GraphBase_0 start compiling with pass pass_cnt_37-DumpBlobParallelConfPass
I20221022 11:06:28.569283 14175 job_build_and_infer_ctx.cpp:957] GraphBase_0 finish compiling with pass pass_cnt_37-DumpBlobParallelConfPass
I20221022 11:06:28.570505 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"4619.000000 MB","mem_vm":"57529.000000 MB","time_cost":"9 seconds"}
I20221022 11:06:28.596138 14175 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 OptimizationLogicalGraph","mem_rss":"4619.000000 MB","mem_vm":"57529.000000 MB","time_cost":"485 seconds"}
I20221022 11:06:38.123991 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GroupBoxingByDstParallel","mem_rss":"4714.000000 MB","mem_vm":"57624.000000 MB","time_cost":"2 seconds"}
I20221022 11:06:50.423911 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 BoxingWithMiddleNodes","mem_rss":"4714.000000 MB","mem_vm":"57624.000000 MB","time_cost":"12 seconds"}
I20221022 11:06:51.981482 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SetCtrlInOpName4VariableOp","mem_rss":"4714.000000 MB","mem_vm":"57624.000000 MB","time_cost":"1 seconds"}
I20221022 11:06:53.710587 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoPrependTick","mem_rss":"4714.000000 MB","mem_vm":"57624.000000 MB","time_cost":"1 seconds"}
I20221022 11:06:55.253793 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddTickForTimeShape","mem_rss":"4714.000000 MB","mem_vm":"57624.000000 MB","time_cost":"1 seconds"}
I20221022 11:06:57.123100 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiClientAutoSourceAndSinkTick","mem_rss":"4722.000000 MB","mem_vm":"57631.000000 MB","time_cost":"1 seconds"}
I20221022 11:06:58.677875 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiClientAutoInterfaceCriticalSectionTick","mem_rss":"4722.000000 MB","mem_vm":"57631.000000 MB","time_cost":"1 seconds"}
I20221022 11:06:58.679165 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SystemOpFillJobNamePass","mem_rss":"4722.000000 MB","mem_vm":"57631.000000 MB","time_cost":"0 seconds"}
I20221022 11:07:00.303792 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"4722.000000 MB","mem_vm":"57631.000000 MB","time_cost":"1 seconds"}
I20221022 11:07:03.175424 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InsertNcclLogicalOpPass","mem_rss":"4764.000000 MB","mem_vm":"57673.000000 MB","time_cost":"2 seconds"}
I20221022 11:07:05.096524 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"4764.000000 MB","mem_vm":"57673.000000 MB","time_cost":"1 seconds"}
I20221022 11:07:06.988651 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CheckOpGraph","mem_rss":"4764.000000 MB","mem_vm":"57673.000000 MB","time_cost":"1 seconds"}
I20221022 11:07:06.989764 14175 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 CompleteJob","mem_rss":"4764.000000 MB","mem_vm":"57673.000000 MB","time_cost":"33 seconds"}
I20221022 11:07:08.469779 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 NewOpGraph","mem_rss":"4764.000000 MB","mem_vm":"57673.000000 MB","time_cost":"1 seconds"}
I20221022 11:07:15.863741 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 BuildTaskGraph","mem_rss":"5171.000000 MB","mem_vm":"58079.000000 MB","time_cost":"7 seconds"}
I20221022 11:07:19.546250 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddTaskToPlan","mem_rss":"5687.000000 MB","mem_vm":"62689.000000 MB","time_cost":"3 seconds"}
I20221022 11:07:20.458809 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InferMemShare","mem_rss":"5699.000000 MB","mem_vm":"63233.000000 MB","time_cost":"0 seconds"}
I20221022 11:07:21.753882 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenMemBlockAndChunk","mem_rss":"5705.000000 MB","mem_vm":"62721.000000 MB","time_cost":"0 seconds"}
I20221022 11:07:21.754942 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 LogPlan","mem_rss":"5705.000000 MB","mem_vm":"62721.000000 MB","time_cost":"0 seconds"}
I20221022 11:07:21.824506 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenRegisterHint","mem_rss":"5705.000000 MB","mem_vm":"62721.000000 MB","time_cost":"0 seconds"}
I20221022 11:07:21.826962 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenCollectiveBoxingPlan","mem_rss":"5705.000000 MB","mem_vm":"62721.000000 MB","time_cost":"0 seconds"}
I20221022 11:07:21.912904 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpCtrlRegstInfoToPlan","mem_rss":"5705.000000 MB","mem_vm":"62721.000000 MB","time_cost":"0 seconds"}
I20221022 11:07:22.057842 14175 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 0, Device: 0 needs to allocate [ 11854.9 MiB ] device memory. 
   In general, Chunk id: 6  memory is [ 4512.24 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6049 MiB ] with mem_block_num = 9406
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 11:07:22.059667 14175 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 1, Device: 1 needs to allocate [ 11854.9 MiB ] device memory. 
   In general, Chunk id: 3  memory is [ 4512.24 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.606 MiB ] with mem_block_num = 9412
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 11:07:22.061463 14175 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 2, Device: 2 needs to allocate [ 11854.9 MiB ] device memory. 
   In general, Chunk id: 4  memory is [ 4512.24 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.606 MiB ] with mem_block_num = 9412
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 11:07:22.063234 14175 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 3, Device: 3 needs to allocate [ 11852.7 MiB ] device memory. 
   In general, Chunk id: 1  memory is [ 4510.03 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.606 MiB ] with mem_block_num = 9412
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 11:07:22.065021 14175 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 4, Device: 4 needs to allocate [ 11854.9 MiB ] device memory. 
   In general, Chunk id: 0  memory is [ 4512.24 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9419
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 11:07:22.066795 14175 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 5, Device: 5 needs to allocate [ 11852.7 MiB ] device memory. 
   In general, Chunk id: 7  memory is [ 4510.03 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9419
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 11:07:22.068562 14175 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 6, Device: 6 needs to allocate [ 11852.7 MiB ] device memory. 
   In general, Chunk id: 2  memory is [ 4510.03 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9419
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 11:07:22.070333 14175 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 7, Device: 7 needs to allocate [ 11852.7 MiB ] device memory. 
   In general, Chunk id: 5  memory is [ 4510.03 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9419
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 11:07:22.257691 14175 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenMemAndLightPlanLog","mem_rss":"5705.000000 MB","mem_vm":"62721.000000 MB","time_cost":"0 seconds"}
I20221022 11:07:22.258664 14175 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 CompilePlan","mem_rss":"5705.000000 MB","mem_vm":"62721.000000 MB","time_cost":"15 seconds"}
I20221022 11:07:24.616750 14175 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 SyncPlan","mem_rss":"5705.000000 MB","mem_vm":"62721.000000 MB","time_cost":"2 seconds"}
F20221022 11:07:29.913981 14175 memory_allocator.cpp:50] out of memory
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/memory/memory_allocator.cpp", line 50, in Allocate
    device->Alloc(options, &ptr, size)
Error Type: oneflow.ErrorProto.out_of_memory_error

@strint

xiezipeng-ML commented 2 years ago
195.47112141002435 compile graph time.......................................
version: 0.8.1+cu112.git.ff736fbee0
git_commit: ff736fbee0
cmake_build_type: Release
rdma: False
mlir: False

Log file created at: 2022/10/22 13:09:15
Running on machine: oneflow-28
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20221022 13:09:15.897207 240087 env_global_objects_scope.cpp:167] Using rpc backend: gRPC
I20221022 13:09:15.897891 240087 ctrl_server.cpp:40] CtrlServer listening on 0.0.0.0:43899
I20221022 13:09:15.897990 240087 rank_info_bootstrap_server.cpp:47] RankInfoBootstrapServer listening on 0.0.0.0:12345
I20221022 13:09:16.427913 240087 epoll_comm_network.cpp:63] CommNet:Epoll listening on 0.0.0.0:41569
W20221022 13:09:26.467613 240087 env_global_objects_scope.cpp:285] Skip init RDMA because RDMA is not compiled!
I20221022 13:11:45.522619 240087 version.cpp:22] OneFlow git version: v0.8.1-377-gff736fbee0-snapshot
I20221022 13:11:45.522656 240087 cuda_device_manager_factory.cpp:63] CUDA runtime version: 11.2
I20221022 13:11:45.522679 240087 cuda_device_manager_factory.cpp:72] cuDNN version: 8.5.0
I20221022 13:11:45.522683 240087 cuda_device_manager_factory.cpp:85] NCCL version: 2.12.10
I20221022 13:12:03.942937 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221022 13:12:10.412919 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_0-InsertPinnedIdentityOpPass
I20221022 13:12:10.413372 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InsertPinnedIdentityOpPass","mem_rss":"4329.000000 MB","mem_vm":"56348.000000 MB","time_cost":"6470 milliseconds"}
I20221022 13:12:10.413403 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221022 13:12:17.357093 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_1-EliminateDeadNodesPass
I20221022 13:12:17.358330 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 EliminateDeadNodesPass","mem_rss":"4360.000000 MB","mem_vm":"56379.000000 MB","time_cost":"6943 milliseconds"}
I20221022 13:12:17.358357 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221022 13:12:17.358379 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_2-NormalizationExponentialAverageAutoTickPass
I20221022 13:12:17.358420 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 NormalizationExponentialAverageAutoTickPass","mem_rss":"4360.000000 MB","mem_vm":"56379.000000 MB","time_cost":"0 milliseconds"}
I20221022 13:12:17.358429 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_3-AutoMixedPrecision
I20221022 13:12:24.232499 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_3-AutoMixedPrecision
I20221022 13:12:24.233603 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoMixedPrecision","mem_rss":"4360.000000 MB","mem_vm":"56379.000000 MB","time_cost":"6874 milliseconds"}
I20221022 13:12:24.233630 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_4-PruneAmpWhiteIdentityOpPass
I20221022 13:12:31.839972 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_4-PruneAmpWhiteIdentityOpPass
I20221022 13:12:31.841379 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneAmpWhiteIdentityOpPass","mem_rss":"4370.000000 MB","mem_vm":"56389.000000 MB","time_cost":"7606 milliseconds"}
I20221022 13:12:31.841421 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_5-OptimizerPlacementOptimizationPass
I20221022 13:12:39.271486 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_5-OptimizerPlacementOptimizationPass
I20221022 13:12:39.272691 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 OptimizerPlacementOptimizationPass","mem_rss":"4370.000000 MB","mem_vm":"56389.000000 MB","time_cost":"7430 milliseconds"}
I20221022 13:12:39.272716 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_6-FuseAddToOutputPass
I20221022 13:12:46.663730 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_6-FuseAddToOutputPass
I20221022 13:12:46.665120 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseAddToOutputPass","mem_rss":"4388.000000 MB","mem_vm":"56407.000000 MB","time_cost":"7391 milliseconds"}
I20221022 13:12:46.665144 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_7-DynamicLossScaleSchedulePass
I20221022 13:12:54.140133 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_7-DynamicLossScaleSchedulePass
I20221022 13:12:54.141510 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DynamicLossScaleSchedulePass","mem_rss":"4388.000000 MB","mem_vm":"56407.000000 MB","time_cost":"7475 milliseconds"}
I20221022 13:12:54.141535 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_8-AutoTrainStep
I20221022 13:13:01.567118 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_8-AutoTrainStep
I20221022 13:13:01.568488 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoTrainStep","mem_rss":"4388.000000 MB","mem_vm":"56407.000000 MB","time_cost":"7425 milliseconds"}
I20221022 13:13:01.568511 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_9-AutoLearningRate
I20221022 13:13:08.981353 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_9-AutoLearningRate
I20221022 13:13:08.982764 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoLearningRate","mem_rss":"4388.000000 MB","mem_vm":"56407.000000 MB","time_cost":"7412 milliseconds"}
I20221022 13:13:08.982787 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_10-QuantAwareTraining
I20221022 13:13:08.982797 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_10-QuantAwareTraining
I20221022 13:13:08.983955 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 QuantAwareTraining","mem_rss":"4388.000000 MB","mem_vm":"56407.000000 MB","time_cost":"0 milliseconds"}
I20221022 13:13:08.983969 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_11-GenerateOptimizerOpConfs
I20221022 13:13:16.783442 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_11-GenerateOptimizerOpConfs
I20221022 13:13:16.784693 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenerateOptimizerOpConfs","mem_rss":"4388.000000 MB","mem_vm":"56407.000000 MB","time_cost":"7799 milliseconds"}
I20221022 13:13:16.784718 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_12-PrunePinnedIdentityOpPass
I20221022 13:13:29.295590 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_12-PrunePinnedIdentityOpPass
I20221022 13:13:29.296972 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PrunePinnedIdentityOpPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"12510 milliseconds"}
I20221022 13:13:29.296999 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_13-ReplaceEmbeddingOps
I20221022 13:13:29.300660 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_13-ReplaceEmbeddingOps
I20221022 13:13:29.301877 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 ReplaceEmbeddingOps","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"3 milliseconds"}
I20221022 13:13:29.301894 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_14-SequentialOneEmbeddingOpsPass
I20221022 13:13:29.301905 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_14-SequentialOneEmbeddingOpsPass
I20221022 13:13:29.303092 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SequentialOneEmbeddingOpsPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"0 milliseconds"}
I20221022 13:13:29.303125 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_15-FuseEmbeddingShuffleInteractionPass
I20221022 13:13:29.303133 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_15-FuseEmbeddingShuffleInteractionPass
I20221022 13:13:29.304189 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseEmbeddingShuffleInteractionPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"0 milliseconds"}
I20221022 13:13:29.304200 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_16-FuseBCEReduceMeanFwBwPass
I20221022 13:13:29.304208 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_16-FuseBCEReduceMeanFwBwPass
I20221022 13:13:29.305286 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseBCEReduceMeanFwBwPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"0 milliseconds"}
I20221022 13:13:29.305299 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_17-AddSspVariableProxy
I20221022 13:13:29.305318 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_17-AddSspVariableProxy
I20221022 13:13:29.306389 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddSspVariableProxy","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"0 milliseconds"}
I20221022 13:13:29.306402 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_18-CheckpointingPass
I20221022 13:13:41.173058 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_18-CheckpointingPass
I20221022 13:13:41.174474 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CheckpointingPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"11866 milliseconds"}
I20221022 13:13:41.174499 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_19-CudnnFusedNormalizationAddReluPass
I20221022 13:13:41.179014 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_19-CudnnFusedNormalizationAddReluPass
I20221022 13:13:41.180243 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CudnnFusedNormalizationAddReluPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"4 milliseconds"}
I20221022 13:13:41.180259 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_20-PruneCastToStaticShapeOpsPass
I20221022 13:13:53.324599 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_20-PruneCastToStaticShapeOpsPass
I20221022 13:13:53.325943 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneCastToStaticShapeOpsPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"12144 milliseconds"}
I20221022 13:13:53.325969 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_21-FuseAddToOutputPass1
I20221022 13:14:05.108726 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_21-FuseAddToOutputPass1
I20221022 13:14:05.110198 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseAddToOutputPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"11782 milliseconds"}
I20221022 13:14:05.110222 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_22-FuseConsecutiveAddPass
I20221022 13:14:05.111447 240087 time_util.h:97] [count log]{"loc":"start","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"0 milliseconds"}
I20221022 13:14:16.325372 240087 time_util.h:97] [count log]{"loc":"create op graph","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"11214 milliseconds"}
I20221022 13:14:16.354195 240087 time_util.h:97] [count log]{"loc":"create job builder","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"28 milliseconds"}
I20221022 13:14:16.613657 240087 time_util.h:97] [count log]{"loc":"fuse_add_apply","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"259 milliseconds"}
I20221022 13:14:16.887930 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_22-FuseConsecutiveAddPass
I20221022 13:14:16.889369 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseConsecutiveAddPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"11777 milliseconds"}
I20221022 13:14:16.889408 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_23-IndexedSlicesOptimizerRewritePass
I20221022 13:14:16.889432 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_23-IndexedSlicesOptimizerRewritePass
I20221022 13:14:16.890709 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 IndexedSlicesOptimizerRewritePass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"0 milliseconds"}
I20221022 13:14:16.890723 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_24-SplitSparseSoftmaxCrossEntropyOpPass
I20221022 13:14:28.530187 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_24-SplitSparseSoftmaxCrossEntropyOpPass
I20221022 13:14:28.531649 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SplitSparseSoftmaxCrossEntropyOpPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"11639 milliseconds"}
I20221022 13:14:28.531674 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_25-DoParallelCastBeforeWideningTypeCast
I20221022 13:14:40.150116 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_25-DoParallelCastBeforeWideningTypeCast
I20221022 13:14:40.151437 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DoParallelCastBeforeWideningTypeCast","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"11618 milliseconds"}
I20221022 13:14:40.151458 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_26-FuseCastScalePass
I20221022 13:14:51.797700 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_26-FuseCastScalePass
I20221022 13:14:51.798893 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseCastScalePass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"11646 milliseconds"}
I20221022 13:14:51.798918 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_27-PruneParallelCastOpsPass
I20221022 13:15:03.215318 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_27-PruneParallelCastOpsPass
I20221022 13:15:03.216629 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PruneParallelCastOpsPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"11416 milliseconds"}
I20221022 13:15:03.216652 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_28-FuseUpdateOpsPass
I20221022 13:15:12.452742 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_28-FuseUpdateOpsPass
I20221022 13:15:12.456284 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseUpdateOpsPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"9236 milliseconds"}
I20221022 13:15:12.456310 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_29-FuseModelUpdateCastOpsPass
I20221022 13:15:12.456326 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_29-FuseModelUpdateCastOpsPass
I20221022 13:15:12.459229 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FuseModelUpdateCastOpsPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"0 milliseconds"}
I20221022 13:15:12.459259 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_30-MultiTensorModelUpdatePass
I20221022 13:15:12.459266 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_30-MultiTensorModelUpdatePass
I20221022 13:15:12.461501 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiTensorModelUpdatePass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"0 milliseconds"}
I20221022 13:15:12.461516 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_31-FixPipelineStageIdPass
I20221022 13:15:12.461524 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_31-FixPipelineStageIdPass
I20221022 13:15:12.463618 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 FixPipelineStageIdPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"0 milliseconds"}
I20221022 13:15:12.463635 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_32-PipelineBufferPass
I20221022 13:15:12.463645 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_32-PipelineBufferPass
I20221022 13:15:12.465061 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 PipelineBufferPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"0 milliseconds"}
I20221022 13:15:12.465075 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_33-AutoParallelPass
I20221022 13:15:12.465081 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_33-AutoParallelPass
I20221022 13:15:12.466367 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoParallelPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"0 milliseconds"}
I20221022 13:15:12.466380 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_34-DumpVariableInfoPass
I20221022 13:15:12.466388 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_34-DumpVariableInfoPass
I20221022 13:15:12.467504 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpVariableInfoPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"0 milliseconds"}
I20221022 13:15:12.467515 240087 job_build_and_infer_ctx.cpp:944] GraphBase_0 start compiling with pass pass_cnt_35-DumpBlobParallelConfPass
I20221022 13:15:19.872992 240087 job_build_and_infer_ctx.cpp:958] GraphBase_0 finish compiling with pass pass_cnt_35-DumpBlobParallelConfPass
I20221022 13:15:19.874497 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"7405 milliseconds"}
I20221022 13:15:19.899092 240087 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 OptimizationLogicalGraph","mem_rss":"4486.000000 MB","mem_vm":"56505.000000 MB","time_cost":"195 seconds"}
I20221022 13:15:24.927664 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GroupBoxingByDstParallel","mem_rss":"4567.000000 MB","mem_vm":"56586.000000 MB","time_cost":"1 seconds"}
I20221022 13:15:32.856587 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 BoxingWithMiddleNodes","mem_rss":"4567.000000 MB","mem_vm":"56586.000000 MB","time_cost":"7 seconds"}
I20221022 13:15:34.249616 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SetCtrlInOpName4VariableOp","mem_rss":"4567.000000 MB","mem_vm":"56586.000000 MB","time_cost":"1 seconds"}
I20221022 13:15:35.775238 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AutoPrependTick","mem_rss":"4567.000000 MB","mem_vm":"56586.000000 MB","time_cost":"1 seconds"}
I20221022 13:15:37.156741 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddTickForTimeShape","mem_rss":"4567.000000 MB","mem_vm":"56586.000000 MB","time_cost":"1 seconds"}
I20221022 13:15:38.828123 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiClientAutoSourceAndSinkTick","mem_rss":"4586.000000 MB","mem_vm":"56604.000000 MB","time_cost":"1 seconds"}
I20221022 13:15:40.219985 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 MultiClientAutoInterfaceCriticalSectionTick","mem_rss":"4586.000000 MB","mem_vm":"56604.000000 MB","time_cost":"1 seconds"}
I20221022 13:15:40.221357 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 SystemOpFillJobNamePass","mem_rss":"4586.000000 MB","mem_vm":"56604.000000 MB","time_cost":"0 seconds"}
I20221022 13:15:41.672730 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"4586.000000 MB","mem_vm":"56604.000000 MB","time_cost":"1 seconds"}
I20221022 13:15:44.102623 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InsertNcclLogicalOpPass","mem_rss":"4635.000000 MB","mem_vm":"56654.000000 MB","time_cost":"2 seconds"}
I20221022 13:15:45.754177 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpBlobParallelConfPass","mem_rss":"4635.000000 MB","mem_vm":"56654.000000 MB","time_cost":"1 seconds"}
I20221022 13:15:47.406502 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 CheckOpGraph","mem_rss":"4635.000000 MB","mem_vm":"56654.000000 MB","time_cost":"1 seconds"}
I20221022 13:15:47.407749 240087 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 CompleteJob","mem_rss":"4635.000000 MB","mem_vm":"56654.000000 MB","time_cost":"26 seconds"}
I20221022 13:15:48.659965 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 NewOpGraph","mem_rss":"4635.000000 MB","mem_vm":"56654.000000 MB","time_cost":"1 seconds"}
I20221022 13:15:55.909827 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 BuildTaskGraph","mem_rss":"5082.000000 MB","mem_vm":"57098.000000 MB","time_cost":"7 seconds"}
I20221022 13:16:00.287573 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 AddTaskToPlan","mem_rss":"5700.000000 MB","mem_vm":"61708.000000 MB","time_cost":"4 seconds"}
I20221022 13:16:01.043056 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 InferMemShare","mem_rss":"5712.000000 MB","mem_vm":"62252.000000 MB","time_cost":"0 seconds"}
I20221022 13:16:02.033635 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenMemBlockAndChunk","mem_rss":"5720.000000 MB","mem_vm":"61740.000000 MB","time_cost":"0 seconds"}
I20221022 13:16:02.034662 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 LogPlan","mem_rss":"5720.000000 MB","mem_vm":"61740.000000 MB","time_cost":"0 seconds"}
I20221022 13:16:02.113368 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenRegisterHint","mem_rss":"5720.000000 MB","mem_vm":"61740.000000 MB","time_cost":"0 seconds"}
I20221022 13:16:02.115805 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenCollectiveBoxingPlan","mem_rss":"5720.000000 MB","mem_vm":"61740.000000 MB","time_cost":"0 seconds"}
I20221022 13:16:02.185315 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 DumpCtrlRegstInfoToPlan","mem_rss":"5720.000000 MB","mem_vm":"61740.000000 MB","time_cost":"0 seconds"}
I20221022 13:16:02.321313 240087 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 0, Device: 0 needs to allocate [ 11854.9 MiB ] device memory. 
   In general, Chunk id: 0  memory is [ 4512.23 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6049 MiB ] with mem_block_num = 9405
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 13:16:02.322976 240087 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 1, Device: 1 needs to allocate [ 11854.9 MiB ] device memory. 
   In general, Chunk id: 7  memory is [ 4512.23 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.606 MiB ] with mem_block_num = 9411
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 13:16:02.324580 240087 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 2, Device: 2 needs to allocate [ 11854.9 MiB ] device memory. 
   In general, Chunk id: 3  memory is [ 4512.23 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.606 MiB ] with mem_block_num = 9411
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 13:16:02.326141 240087 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 3, Device: 3 needs to allocate [ 11854.9 MiB ] device memory. 
   In general, Chunk id: 1  memory is [ 4512.23 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.606 MiB ] with mem_block_num = 9411
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 13:16:02.327700 240087 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 4, Device: 4 needs to allocate [ 11852.7 MiB ] device memory. 
   In general, Chunk id: 2  memory is [ 4510.03 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 13:16:02.329223 240087 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 5, Device: 5 needs to allocate [ 11852.7 MiB ] device memory. 
   In general, Chunk id: 4  memory is [ 4510.03 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 13:16:02.330744 240087 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 6, Device: 6 needs to allocate [ 11852.7 MiB ] device memory. 
   In general, Chunk id: 6  memory is [ 4510.03 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 13:16:02.332269 240087 plan_util.cpp:965] 
 Graph name GraphBase_0 in Rank: 7, Device: 7 needs to allocate [ 11852.7 MiB ] device memory. 
   In general, Chunk id: 5  memory is [ 4510.03 MiB ] with mem_block_num = 1
        Unreused memory not eager var is  [ 12.6075 MiB ] with mem_block_num = 9418
        Eager Variable Tensor total memory is [ 7330.05 MiB ] with mem_block_num = 2616
I20221022 13:16:02.476676 240087 time_util.h:97] [count log]{"loc":"[GraphCompile]GraphBase_0 GenMemAndLightPlanLog","mem_rss":"5720.000000 MB","mem_vm":"61740.000000 MB","time_cost":"0 seconds"}
I20221022 13:16:02.477138 240087 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 CompilePlan","mem_rss":"5720.000000 MB","mem_vm":"61740.000000 MB","time_cost":"15 seconds"}
I20221022 13:16:04.704389 240087 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 SyncPlan","mem_rss":"5720.000000 MB","mem_vm":"61740.000000 MB","time_cost":"2 seconds"}
F20221022 13:16:09.865818 240087 memory_allocator.cpp:50] out of memory
  File "/home/xiezipeng/oneflow/oneflow/core/memory/memory_allocator.cpp", line 50, in Allocate
    device->Alloc(options, &ptr, size)
Error Type: oneflow.ErrorProto.out_of_memory_error

@strint 啸宇 这里重新编译测了一遍,应该正常了

xiezipeng-ML commented 2 years ago

看graph的编译时间是快了2.5倍 compile graph time:195.47112141002435 (https://github.com/Oneflow-Inc/oneflow/pull/9281 ) compile graph time 482.94916362699587 (nightly)

减少 59 %

strint commented 2 years ago

之前 pass 的编译时间:

graph 编译开销 15:14:04~15:33:37,总时间为 19.5

现在是 9.1 分钟。减少 53%,符合预期。

I20221028 15:12:11.334821 2360039 time_util.h:95] [count log]{"loc":"[GraphCompile]GraphBase_0 OptimizationLogicalGraph","mem_rss":"21991.000000 MB","mem_vm":"354360.000000 MB","time_cost":"546 seconds"}