flexflow / FlexFlow

FlexFlow Serve: Low-Latency, High-Performance LLM Serving
https://flexflow.readthedocs.io
Apache License 2.0
1.67k stars 224 forks source link

When I do the Speculative Inference following the Quickstart in readme, the bug happens. I set num_gpus=2, and memory_per_gpu=40000. How can I solve the problem? #1181

Open kar9999 opened 1 year ago

kar9999 commented 1 year ago

[0 - 7f09abe0a4c0] 0.423606 {3}{Mapper}: Enabled Control Replication Optimizations. [0 - 7f09abe0a4c0] 0.423645 {3}{Mapper}: Enabled Control Replication Optimizations. [0 - 7f09abe0a4c0] 0.423656 {3}{Mapper}: Enabled Control Replication Optimizations. [0 - 7f09abe0a4c0] 0.423664 {3}{Mapper}: Enabled Control Replication Optimizations. [0 - 7f09abe0a4c0] 0.423672 {3}{Mapper}: Enabled Control Replication Optimizations. [0 - 7f09abe0a4c0] 0.423679 {3}{Mapper}: Enabled Control Replication Optimizations. workSpaceSize (1024 MB) workSpaceSize (1024 MB) spec create operator: layers_0_attention_1000003 spec create operator: layers_1_attention_1000010 num_nodes = 1 num_gpus_per_node = 2 optimal_views.size = 20 views.size() = 20 Deserialized Views... node[5000020]: type(Input_5000020) view(1 1 0) node[5000032]: type(Dense_5000032) view(1 1 0) inEdge(node(5000031) idx(1)) node[5000021]: type(Embedding_5000021) view(1 1 0) inEdge(node(5000020) idx(0)) node[5000034]: type(SigmoidSiluMulti_5000034) view(1 1 0) inEdge(node(5000032) idx(0)) inEdge(node(5000033) idx(0)) node[5000022]: type(RMSNorm_5000022) view(1 1 0) inEdge(node(5000021) idx(0)) node[5000035]: type(Dense_5000035) view(1 1 0) inEdge(node(5000034) idx(0)) node[5000023]: type(SpecIncMultiHeadSelfAttention_5000023) view(1 1 0) inEdge(node(5000022) idx(0)) node[5000033]: type(Dense_5000033) view(1 1 0) inEdge(node(5000031) idx(1)) node[5000031]: type(ResidualRMSNorm_5000031) view(1 1 0) inEdge(node(5000030) idx(0)) inEdge(node(5000029) idx(0)) node[5000030]: type(SpecIncMultiHeadSelfAttention_5000030) view(1 1 0) inEdge(node(5000029) idx(1)) node[5000029]: type(ResidualRMSNorm_5000029) view(1 1 0) inEdge(node(5000028) idx(0)) inEdge(node(5000024) idx(0)) node[5000028]: type(Dense_5000028) view(1 1 0) inEdge(node(5000027) idx(0)) node[5000027]: type(SigmoidSiluMulti_5000027) view(1 1 0) inEdge(node(5000025) idx(0)) inEdge(node(5000026) idx(0)) node[5000025]: type(Dense_5000025) view(1 1 0) inEdge(node(5000024) idx(1)) node[5000038]: type(Softmax_5000038) view(1 1 0) inEdge(node(5000037) idx(0)) node[5000039]: type(ArgMax_5000039) view(1 1 0) inEdge(node(5000038) idx(0)) node[5000026]: type(Dense_5000026) view(1 1 0) inEdge(node(5000024) idx(1)) node[5000037]: type(Dense_5000037) view(1 1 0) inEdge(node(5000036) idx(1)) node[5000024]: type(ResidualRMSNorm_5000024) view(1 1 0) inEdge(node(5000023) idx(0)) inEdge(node(5000021) idx(0)) node[5000036]: type(ResidualRMSNorm_5000036) view(1 1 0) inEdge(node(5000035) idx(0)) inEdge(node(5000031) idx(0)) digraph taskgraph { node0 [label="{ Dense_5000032 | { 3072/1 | 1/1 | 64/1 | 1/1 } }",shape=record]; node1 -> node0; node1 [label="{ ResidualRMSNorm_5000031 }",shape=record]; node2 -> node1; node3 -> node1; node3 [label="{ SpecIncMultiHeadSelfAttention_5000030 | { 768/1 | 1/1 | 64/1 | 1/1 } }",shape=record]; node2 -> node3; node2 [label="{ ResidualRMSNorm_5000029 }",shape=record]; node4 -> node2; node5 -> node2; node5 [label="{ Dense_5000028 | { 768/1 | 1/1 | 64/1 | 1/1 } }",shape=record]; node6 -> node5; node6 [label="{ SigmoidSiluMulti_5000027 | { 3072/1 | 1/1 | 64/1 | 1/1 } }",shape=record]; node7 -> node6; node8 -> node6; node7 [label="{ Dense_5000026 | { 3072/1 | 1/1 | 64/1 | 1/1 } }",shape=record]; node4 -> node7; node9 [label="{ ArgMax_5000039 }",shape=record]; node10 -> node9; node10 [label="{ Softmax_5000038 | { 32000/1 | 1/1 | 64/1 | 1/1 } }",shape=record]; node11 -> node10; node8 [label="{ Dense_5000025 | { 3072/1 | 1/1 | 64/1 | 1/1 } }",shape=record]; node4 -> node8; node11 [label="{ Dense_5000037 | { 32000/1 | 1/1 | 64/1 | 1/1 } }",shape=record]; node12 -> node11; node4 [label="{ ResidualRMSNorm_5000024 }",shape=record]; node13 -> node4; node14 -> node4; node12 [label="{ ResidualRMSNorm_5000036 }",shape=record]; node1 -> node12; node15 -> node12; node14 [label="{ SpecIncMultiHeadSelfAttention_5000023 | { 768/1 | 1/1 | 64/1 | 1/1 } }",shape=record]; node16 -> node14; node15 [label="{ Dense_5000035 | { 768/1 | 1/1 | 64/1 | 1/1 } }",shape=record]; node17 -> node15; node16 [label="{ RMSNorm_5000022 | { 768/1 | 1/1 | 64/1 | 1/1 } }",shape=record]; node13 -> node16; node17 [label="{ SigmoidSiluMulti_5000034 | { 3072/1 | 1/1 | 64/1 | 1/1 } }",shape=record]; node18 -> node17; node0 -> node17; node13 [label="{ Embedding_5000021 | { 768/1 | 1/1 | 64/1 | 1/1 } }",shape=record]; node19 -> node13; node18 [label="{ Dense_5000033 | { 3072/1 | 1/1 | 64/1 | 1/1 } }",shape=record]; node1 -> node18; node19 [label="{ Input_5000020 | { shape([ 1/1 64/1 1/1 ]) } }",shape=record]; } Applying fusion optimizations during compilation... 35 operators before fusion... 18 operators after fusion... 2023-10-07 10:34:25.366530: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2023-10-07 10:34:25.422504: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-10-07 10:34:26.265941: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT ndim(1) dims[1 0 0 0] operator[0]: type(Input) guid(2000040) outputs[0] region(6,1,1) operator[1]: type(Weight) guid(2000042) outputs[0] region(8,2,2) operator[2]: type(FusedOp) guid(2000075) inputs[0] region(6,1,1) outputs[0] region(10,3,3) outputs[1] region(14,5,5) outputs[2] region(18,7,7) outputs[3] region(22,9,9) outputs[4] region(24,10,10) outputs[5] region(28,12,12) outputs[6] region(32,14,14) outputs[7] region(34,15,15) outputs[8] region(38,17,17) outputs[9] region(42,19,19) outputs[10] region(44,20,20) outputs[11] region(48,22,22) outputs[12] region(52,24,24) outputs[13] region(54,25,25) outputs[14] region(58,27,27) outputs[15] region(62,29,29) outputs[16] region(64,30,30) outputs[17] region(68,32,32) outputs[18] region(72,34,34) outputs[19] region(74,35,35) outputs[20] region(78,37,37) outputs[21] region(80,38,38) weights[0] region(8,2,2) weights[1] region(12,4,4) weights[2] region(16,6,6) weights[3] region(20,8,8) weights[4] region(26,11,11) weights[5] region(30,13,13) weights[6] region(36,16,16) weights[7] region(40,18,18) weights[8] region(46,21,21) weights[9] region(50,23,23) weights[10] region(56,26,26) weights[11] region(60,28,28) weights[12] region(66,31,31) weights[13] region(70,33,33) weights[14] region(76,36,36) operator[3]: type(Weight) guid(2000044) outputs[0] region(12,4,4) operator[4]: type(Weight) guid(2000046) outputs[0] region(16,6,6) operator[5]: type(Weight) guid(2000048) outputs[0] region(20,8,8) operator[6]: type(Weight) guid(2000050) outputs[0] region(26,11,11) operator[7]: type(Weight) guid(2000052) outputs[0] region(30,13,13) operator[8]: type(Weight) guid(2000055) outputs[0] region(36,16,16) operator[9]: type(Weight) guid(2000057) outputs[0] region(40,18,18) operator[10]: type(Weight) guid(2000059) outputs[0] region(46,21,21) operator[11]: type(Weight) guid(2000061) outputs[0] region(50,23,23) operator[12]: type(Weight) guid(2000063) outputs[0] region(56,26,26) operator[13]: type(Weight) guid(2000065) outputs[0] region(60,28,28) operator[14]: type(Weight) guid(2000068) outputs[0] region(66,31,31) operator[15]: type(Weight) guid(2000070) outputs[0] region(70,33,33) operator[16]: type(Weight) guid(2000072) outputs[0] region(76,36,36) operator[17]: type(ArgMax) guid(2000074) inputs[0] region(80,38,38) outputs[0] region(82,39,39) outputs[1] region(84,40,40) operator[0]: type(0) outputs[0] region(6,1,1) operator[1]: type(1) outputs[0] region(8,2,2) operator[2]: type(78) inputs[0] region(6,1,1) outputs[0] region(10,3,3) outputs[1] region(14,5,5) outputs[2] region(18,7,7) outputs[3] region(22,9,9) outputs[4] region(24,10,10) outputs[5] region(28,12,12) outputs[6] region(32,14,14) outputs[7] region(34,15,15) outputs[8] region(38,17,17) outputs[9] region(42,19,19) outputs[10] region(44,20,20) outputs[11] region(48,22,22) outputs[12] region(52,24,24) outputs[13] region(54,25,25) outputs[14] region(58,27,27) outputs[15] region(62,29,29) outputs[16] region(64,30,30) outputs[17] region(68,32,32) outputs[18] region(72,34,34) outputs[19] region(74,35,35) outputs[20] region(78,37,37) outputs[21] region(80,38,38) operator[3]: type(1) outputs[0] region(12,4,4) operator[4]: type(1) outputs[0] region(16,6,6) operator[5]: type(1) outputs[0] region(20,8,8) operator[6]: type(1) outputs[0] region(26,11,11) operator[7]: type(1) outputs[0] region(30,13,13) operator[8]: type(1) outputs[0] region(36,16,16) operator[9]: type(1) outputs[0] region(40,18,18) operator[10]: type(1) outputs[0] region(46,21,21) operator[11]: type(1) outputs[0] region(50,23,23) operator[12]: type(1) outputs[0] region(56,26,26) operator[13]: type(1) outputs[0] region(60,28,28) operator[14]: type(1) outputs[0] region(66,31,31) operator[15]: type(1) outputs[0] region(70,33,33) operator[16]: type(1) outputs[0] region(76,36,36) operator[17]: type(91) inputs[0] region(80,38,38) outputs[0] region(82,39,39) outputs[1] region(84,40,40) Loading weight file tok_embeddings_weight Loading weight file layers_0_attention_norm_weight Loading weight file layers_0_attention_wq_weight Loading weight file layers_0_attention_wk_weight Loading weight file layers_0_attention_wv_weight Loading weight file layers_0_attention_wo_weight Loading weight file layers_0_ffn_norm_weight Loading weight file layers_0_feed_forward_w1_weight Loading weight file layers_0_feed_forward_w3_weight Loading weight file layers_0_feed_forward_w2_weight Loading weight file layers_1_attention_norm_weight Loading weight file layers_1_attention_wq_weight Loading weight file layers_1_attention_wk_weight Loading weight file layers_1_attention_wv_weight Loading weight file layers_1_attention_wo_weight Loading weight file layers_1_ffn_norm_weight Loading weight file layers_1_feed_forward_w1_weight Loading weight file layers_1_feed_forward_w3_weight Loading weight file layers_1_feed_forward_w2_weight Loading weight file norm_weight Loading weight file output_weight Cuda failure: 2 /usr/FlexFlow/src/runtime/initializer_kernel.cu:310 Aborting... python: /usr/FlexFlow/src/runtime/initializer_kernel.cu:310: static void FlexFlow::ConstantInitializer::init_task(const Legion::Task, const std::vector&, Legion::Context, Legion::Runtime): Assertion `false' failed.

lockshaw commented 1 year ago

@goliaro @jiazhihao Any ideas?