Open xanderdunn opened 4 months ago
Graph: rust_hlo_last_pipedepth_forward_backward_18167585414663658369_rank_26.pb.zip
Compiler crash:
Script '"neuronx-cc" "compile" "/mnt/drive1/tmp/rust_hlo_last_pipedepth_forward_backward_18167585414663658369_rank_26.pb" "--framework=XLA" "--target=trn1" "--model-type=transformer" "--internal-hlo2tensorizer-options=--verify-hlo" "--auto-cast=none" "--output" "/mnt/drive1/tmp/last_pipedepth_forward_backward_18167585414663658369_rank_26_12925239939861740968pb_8460407994547285865.neff"' failed: 2024-02-13T01:15:56Z [TEN404] (_attn0.subtract.114) Internal tensorizer error - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new , stdout: 2024-02-13T01:15:38Z Compilation is optimized for best performance and compilation time. For faster compilation time please use -O1 2024-02-13T01:15:39Z Running DoNothing 2024-02-13T01:15:39Z DoNothing finished after 0.000 seconds 2024-02-13T01:15:39Z Running AliasDependencyInduction 2024-02-13T01:15:39Z AliasDependencyInduction finished after 0.001 seconds 2024-02-13T01:15:39Z Running CanonicalizeIR 2024-02-13T01:15:39Z CanonicalizeIR finished after 0.005 seconds 2024-02-13T01:15:39Z Running LegalizeCCOpLayout 2024-02-13T01:15:39Z LegalizeCCOpLayout finished after 0.006 seconds 2024-02-13T01:15:39Z Running ExpandBatchNorm 2024-02-13T01:15:39Z ExpandBatchNorm finished after 0.011 seconds 2024-02-13T01:15:39Z Running ResolveComplicatePredicates 2024-02-13T01:15:39Z ResolveComplicatePredicates finished after 0.008 seconds 2024-02-13T01:15:39Z Running AffinePredicateResolution 2024-02-13T01:15:39Z AffinePredicateResolution finished after 0.008 seconds 2024-02-13T01:15:39Z Running EliminateDivs 2024-02-13T01:15:39Z EliminateDivs finished after 0.006 seconds 2024-02-13T01:15:39Z Running PerfectLoopNest 2024-02-13T01:15:39Z PerfectLoopNest finished after 0.009 seconds 2024-02-13T01:15:39Z Running Simplifier 2024-02-13T01:15:39Z Simplifier finished after 0.067 seconds 2024-02-13T01:15:39Z Running GenericAccessSimplifier 2024-02-13T01:15:39Z GenericAccessSimplifier finished after 0.005 seconds 2024-02-13T01:15:39Z Running TCTransform 2024-02-13T01:15:39Z TCTransform finished after 0.010 seconds 2024-02-13T01:15:39Z Running CommuteConcat 2024-02-13T01:15:39Z CommuteConcat finished after 0.005 seconds 2024-02-13T01:15:39Z Running LowerTensorOp 2024-02-13T01:15:39Z LowerTensorOp finished after 0.075 seconds 2024-02-13T01:15:39Z Running TCTransform 2024-02-13T01:15:39Z TCTransform finished after 0.015 seconds 2024-02-13T01:15:39Z Running CanonicalizeIR 2024-02-13T01:15:39Z CanonicalizeIR finished after 0.013 seconds 2024-02-13T01:15:39Z Running TensorOpFusion 2024-02-13T01:15:39Z TensorOpFusion finished after 0.014 seconds 2024-02-13T01:15:39Z Running TensorOpTransform 2024-02-13T01:15:39Z TensorOpTransform finished after 0.049 seconds 2024-02-13T01:15:39Z Running LateLowerTensorOp 2024-02-13T01:15:39Z LateLowerTensorOp finished after 0.015 seconds 2024-02-13T01:15:39Z Running MemcpyElimination 2024-02-13T01:15:40Z MemcpyElimination finished after 0.327 seconds 2024-02-13T01:15:40Z Running LoopFusion 2024-02-13T01:15:40Z LoopFusion finished after 0.317 seconds 2024-02-13T01:15:40Z Running Simplifier 2024-02-13T01:15:40Z Simplifier finished after 0.058 seconds 2024-02-13T01:15:40Z Running Delinearization 2024-02-13T01:15:40Z Delinearization finished after 0.156 seconds 2024-02-13T01:15:40Z Running AliasDependencyElimination 2024-02-13T01:15:40Z AliasDependencyElimination finished after 0.007 seconds 2024-02-13T01:15:40Z Running DeadStoreElimination 2024-02-13T01:15:41Z DeadStoreElimination finished after 0.714 seconds 2024-02-13T01:15:41Z Running AliasDependencyInduction 2024-02-13T01:15:41Z AliasDependencyInduction finished after 0.001 seconds 2024-02-13T01:15:41Z Running Simplifier 2024-02-13T01:15:41Z Simplifier finished after 0.041 seconds 2024-02-13T01:15:41Z Running LICM 2024-02-13T01:15:41Z LICM finished after 0.016 seconds 2024-02-13T01:15:41Z Running Delinearization 2024-02-13T01:15:41Z Delinearization finished after 0.022 seconds 2024-02-13T01:15:41Z Running LoopFusion 2024-02-13T01:15:41Z LoopFusion finished after 0.159 seconds 2024-02-13T01:15:41Z Running SimplifySlice 2024-02-13T01:15:41Z SimplifySlice finished after 0.007 seconds 2024-02-13T01:15:41Z Running LICM 2024-02-13T01:15:41Z LICM finished after 0.011 seconds 2024-02-13T01:15:41Z Running Simplifier 2024-02-13T01:15:41Z Simplifier finished after 0.041 seconds 2024-02-13T01:15:41Z Running ValueNumbering 2024-02-13T01:15:41Z ValueNumbering finished after 0.020 seconds 2024-02-13T01:15:41Z Running LICM 2024-02-13T01:15:41Z LICM finished after 0.011 seconds 2024-02-13T01:15:41Z Running PadElimination 2024-02-13T01:15:41Z PadElimination finished after 0.000 seconds 2024-02-13T01:15:41Z Running Delinearization 2024-02-13T01:15:41Z Delinearization finished after 0.021 seconds 2024-02-13T01:15:41Z Running LoopFusion 2024-02-13T01:15:42Z LoopFusion finished after 0.150 seconds 2024-02-13T01:15:42Z Running GenericAccessSimplifier 2024-02-13T01:15:42Z GenericAccessSimplifier finished after 0.007 seconds 2024-02-13T01:15:42Z Running Simplifier 2024-02-13T01:15:42Z Simplifier finished after 0.020 seconds 2024-02-13T01:15:42Z Running LICM 2024-02-13T01:15:42Z LICM finished after 0.011 seconds 2024-02-13T01:15:42Z Running ValueNumbering 2024-02-13T01:15:42Z ValueNumbering finished after 0.016 seconds 2024-02-13T01:15:42Z Running TCTransform 2024-02-13T01:15:42Z TCTransform finished after 0.009 seconds 2024-02-13T01:15:42Z Running CommuteConcat 2024-02-13T01:15:42Z CommuteConcat finished after 0.008 seconds 2024-02-13T01:15:42Z Running RecognizeOpIdiom 2024-02-13T01:15:42Z RecognizeOpIdiom finished after 0.031 seconds 2024-02-13T01:15:42Z Running MaskPropagation 2024-02-13T01:15:42Z MaskPropagation finished after 0.030 seconds 2024-02-13T01:15:42Z Running DeadStoreElimination 2024-02-13T01:15:42Z DeadStoreElimination finished after 0.211 seconds 2024-02-13T01:15:42Z Running Recompute 2024-02-13T01:15:42Z Recompute finished after 0.001 seconds 2024-02-13T01:15:42Z Running DeadCodeElimination 2024-02-13T01:15:42Z DeadCodeElimination finished after 0.007 seconds 2024-02-13T01:15:42Z Running DoNothing 2024-02-13T01:15:42Z DoNothing finished after 0.000 seconds 2024-02-13T01:15:42Z Running MutateDataType 2024-02-13T01:15:42Z MutateDataType finished after 0.006 seconds 2024-02-13T01:15:42Z Running AutoCastTCInputs 2024-02-13T01:15:42Z AutoCastTCInputs finished after 0.010 seconds 2024-02-13T01:15:42Z Running GenericAccessSimplifier 2024-02-13T01:15:42Z GenericAccessSimplifier finished after 0.007 seconds 2024-02-13T01:15:42Z Running Simplifier 2024-02-13T01:15:42Z Simplifier finished after 0.020 seconds 2024-02-13T01:15:42Z Running AliasDependencyElimination 2024-02-13T01:15:42Z AliasDependencyElimination finished after 0.007 seconds 2024-02-13T01:15:42Z Running DelinearIndices 2024-02-13T01:15:42Z DelinearIndices finished after 0.342 seconds 2024-02-13T01:15:42Z Running Delinearization 2024-02-13T01:15:42Z Delinearization finished after 0.022 seconds 2024-02-13T01:15:42Z Running DelinearIndices 2024-02-13T01:15:42Z DelinearIndices finished after 0.062 seconds 2024-02-13T01:15:42Z Running DeadCodeElimination 2024-02-13T01:15:42Z DeadCodeElimination finished after 0.008 seconds 2024-02-13T01:15:42Z Running InferIntrinsicOnCC 2024-02-13T01:15:43Z InferIntrinsicOnCC finished after 0.093 seconds 2024-02-13T01:15:43Z Running ResolveAccessConflict 2024-02-13T01:15:43Z ResolveAccessConflict finished after 0.043 seconds 2024-02-13T01:15:43Z Running LICM 2024-02-13T01:15:43Z LICM finished after 0.014 seconds 2024-02-13T01:15:43Z Running LocalLayoutOpt 2024-02-13T01:15:43Z LocalLayoutOpt finished after 0.111 seconds 2024-02-13T01:15:43Z Running DelinearIndices 2024-02-13T01:15:43Z DelinearIndices finished after 0.066 seconds 2024-02-13T01:15:43Z Running PGLayoutTilingPipeline 2024-02-13T01:15:43Z Running PAGLayoutOpt 2024-02-13T01:15:43Z Running Delinearization 2024-02-13T01:15:43Z Delinearization finished after 0.023 seconds 2024-02-13T01:15:43Z PAGLayoutOpt finished after 0.440 seconds 2024-02-13T01:15:43Z Running MaskPropagation 2024-02-13T01:15:43Z MaskPropagation finished after 0.051 seconds 2024-02-13T01:15:43Z Running CanonicalizeDAGForPGTiling 2024-02-13T01:15:43Z CanonicalizeDAGForPGTiling finished after 0.034 seconds 2024-02-13T01:15:43Z Running PGTiling 2024-02-13T01:15:43Z Running AGOrderingAnalysisPass 2024-02-13T01:15:44Z AGOrderingAnalysisPass finished after 0.235 seconds 2024-02-13T01:15:44Z Running CuttingAndMacroGeneration 2024-02-13T01:15:44Z CuttingAndMacroGeneration finished after 0.563 seconds 2024-02-13T01:15:44Z PGTiling finished after 0.808 seconds 2024-02-13T01:15:44Z Running InsertIOTransposes 2024-02-13T01:15:44Z InsertIOTransposes finished after 0.089 seconds 2024-02-13T01:15:44Z PGLayoutTilingPipeline finished after 1.448 seconds 2024-02-13T01:15:44Z Running TilingProfiler 2024-02-13T01:15:44Z TilingProfiler finished after 0.087 seconds 2024-02-13T01:15:44Z Running FlattenMacroLoop 2024-02-13T01:15:45Z FlattenMacroLoop finished after 0.183 seconds 2024-02-13T01:15:45Z Running InferTongaTensor 2024-02-13T01:15:45Z InferTongaTensor finished after 0.319 seconds 2024-02-13T01:15:45Z Running TongaSimplifier 2024-02-13T01:15:45Z TongaSimplifier finished after 0.228 seconds 2024-02-13T01:15:45Z Running LICM 2024-02-13T01:15:45Z LICM finished after 0.022 seconds 2024-02-13T01:15:45Z Running RewriteReplicationMatmul 2024-02-13T01:15:45Z RewriteReplicationMatmul finished after 0.014 seconds 2024-02-13T01:15:45Z Running FlattenMacroLoop 2024-02-13T01:15:45Z FlattenMacroLoop finished after 0.048 seconds 2024-02-13T01:15:45Z Running SimplifyMacroPredicates 2024-02-13T01:15:45Z SimplifyMacroPredicates finished after 0.120 seconds 2024-02-13T01:15:45Z Running DataLocalityOpt 2024-02-13T01:15:46Z DataLocalityOpt finished after 0.502 seconds 2024-02-13T01:15:46Z Running TongaSimplifier 2024-02-13T01:15:46Z TongaSimplifier finished after 0.084 seconds 2024-02-13T01:15:46Z Running LegalizeSundaMacro 2024-02-13T01:15:46Z LegalizeSundaMacro finished after 0.042 seconds 2024-02-13T01:15:46Z Running TongaSimplifier 2024-02-13T01:15:46Z TongaSimplifier finished after 0.085 seconds 2024-02-13T01:15:46Z Running PerfectLoopNest 2024-02-13T01:15:46Z PerfectLoopNest finished after 0.016 seconds 2024-02-13T01:15:46Z Running FlattenMacroLoop 2024-02-13T01:15:46Z FlattenMacroLoop finished after 0.045 seconds 2024-02-13T01:15:46Z Running RewriteWeights 2024-02-13T01:15:46Z RewriteWeights finished after 0.019 seconds 2024-02-13T01:15:46Z Running ReshapeWeights 2024-02-13T01:15:46Z ReshapeWeights finished after 0.003 seconds 2024-02-13T01:15:46Z Running FlattenMacroLoop 2024-02-13T01:15:46Z FlattenMacroLoop finished after 0.018 seconds 2024-02-13T01:15:46Z Running SimplifyMacroPredicates 2024-02-13T01:15:46Z SimplifyMacroPredicates finished after 0.150 seconds 2024-02-13T01:15:46Z Running InferInitValue 2024-02-13T01:15:47Z InferInitValue finished after 0.777 seconds 2024-02-13T01:15:47Z Running TongaSimplifier 2024-02-13T01:15:47Z TongaSimplifier finished after 0.081 seconds 2024-02-13T01:15:47Z Running SimplifyTensor 2024-02-13T01:15:47Z SimplifyTensor finished after 0.069 seconds 2024-02-13T01:15:47Z Running LICM 2024-02-13T01:15:47Z LICM finished after 0.024 seconds 2024-02-13T01:15:47Z Running SundaISel 2024-02-13T01:15:48Z SundaISel finished after 0.289 seconds 2024-02-13T01:15:48Z Running LowerThorKernels 2024-02-13T01:15:48Z LowerThorKernels finished after 0.009 seconds 2024-02-13T01:15:48Z Running TongaLoopInterchange 2024-02-13T01:15:48Z TongaLoopInterchange finished after 0.009 seconds 2024-02-13T01:15:48Z Running TongaSimplifyPredicates 2024-02-13T01:15:48Z TongaSimplifyPredicates finished after 0.009 seconds 2024-02-13T01:15:48Z Running TongaLoopFusion 2024-02-13T01:15:48Z TongaLoopFusion finished after 0.282 seconds 2024-02-13T01:15:48Z Running TongaLoopInterchange 2024-02-13T01:15:48Z TongaLoopInterchange finished after 0.008 seconds 2024-02-13T01:15:48Z Running TongaLICM 2024-02-13T01:15:48Z TongaLICM finished after 0.036 seconds 2024-02-13T01:15:48Z Running FactorizeBlkDims 2024-02-13T01:15:48Z FactorizeBlkDims finished after 0.058 seconds 2024-02-13T01:15:48Z Running TongaInstComb 2024-02-13T01:15:50Z TongaInstComb finished after 2.147 seconds 2024-02-13T01:15:50Z Running TongaValueNumbering 2024-02-13T01:15:50Z TongaValueNumbering finished after 0.024 seconds 2024-02-13T01:15:50Z Running TongaInstComb 2024-02-13T01:15:52Z TongaInstComb finished after 1.929 seconds 2024-02-13T01:15:52Z Running VectorizeDMA 2024-02-13T01:15:52Z VectorizeDMA finished after 0.029 seconds 2024-02-13T01:15:52Z Running TongaSimplifyPredicates 2024-02-13T01:15:52Z TongaSimplifyPredicates finished after 0.007 seconds 2024-02-13T01:15:52Z Running LegalizePartitionReduce 2024-02-13T01:15:52Z LegalizePartitionReduce finished after 0.014 seconds 2024-02-13T01:15:52Z Running DeConcat 2024-02-13T01:15:52Z DeConcat finished after 0.012 seconds 2024-02-13T01:15:52Z Running PartialSimdFusion 2024-02-13T01:15:52Z PartialSimdFusion finished after 0.120 seconds 2024-02-13T01:15:52Z Running TritiumFusion 2024-02-13T01:15:52Z TritiumFusion finished after 0.027 seconds 2024-02-13T01:15:52Z Running CCOpFusion 2024-02-13T01:15:53Z CCOpFusion finished after 0.070 seconds 2024-02-13T01:15:53Z Running VectorizeMatMult 2024-02-13T01:15:53Z VectorizeMatMult finished after 0.003 seconds 2024-02-13T01:15:53Z Running PartialLoopFusion 2024-02-13T01:15:53Z PartialLoopFusion finished after 0.109 seconds 2024-02-13T01:15:53Z Running TongaLICM 2024-02-13T01:15:53Z TongaLICM finished after 0.031 seconds 2024-02-13T01:15:53Z Running LowerTranspose 2024-02-13T01:15:53Z LowerTranspose finished after 0.126 seconds 2024-02-13T01:15:53Z Running LateTongaInstComb 2024-02-13T01:15:53Z LateTongaInstComb finished after 0.143 seconds 2024-02-13T01:15:53Z Running SplitAccGrp 2024-02-13T01:15:53Z SplitAccGrp finished after 0.007 seconds 2024-02-13T01:15:53Z Running SpillPSum 2024-02-13T01:15:53Z SpillPSum finished after 0.136 seconds 2024-02-13T01:15:53Z Running LowerIntrinsics 2024-02-13T01:15:53Z LowerIntrinsics finished after 0.019 seconds 2024-02-13T01:15:53Z Running LegalizeType 2024-02-13T01:15:53Z LegalizeType finished after 0.018 seconds 2024-02-13T01:15:53Z Running TongaLICM 2024-02-13T01:15:53Z TongaLICM finished after 0.035 seconds 2024-02-13T01:15:53Z Running InferPSumTensor 2024-02-13T01:15:53Z InferPSumTensor finished after 0.236 seconds 2024-02-13T01:15:53Z Running WeightCoalescing 2024-02-13T01:15:54Z WeightCoalescing finished after 0.009 seconds 2024-02-13T01:15:54Z Running LegalizeSundaAccess 2024-02-13T01:15:54Z LegalizeSundaAccess finished after 0.074 seconds 2024-02-13T01:15:54Z Running TernaryFission 2024-02-13T01:15:54Z TernaryFission finished after 0.438 seconds 2024-02-13T01:15:54Z Running RelaxPredicates 2024-02-13T01:15:54Z RelaxPredicates finished after 0.017 seconds 2024-02-13T01:15:54Z Running TensorInitialization 2024-02-13T01:15:54Z TensorInitialization finished after 0.165 seconds 2024-02-13T01:15:54Z Running TongaSimplifyPredicates 2024-02-13T01:15:54Z TongaSimplifyPredicates finished after 0.031 seconds 2024-02-13T01:15:54Z Running ExpandISAMacro 2024-02-13T01:15:54Z ExpandISAMacro finished after 0.024 seconds 2024-02-13T01:15:54Z Running SimplifyTongaTensor 2024-02-13T01:15:54Z SimplifyTongaTensor finished after 0.062 seconds 2024-02-13T01:15:54Z Running DMALocalityOpt 2024-02-13T01:15:54Z DMALocalityOpt finished after 0.006 seconds 2024-02-13T01:15:54Z Running DataStreaming 2024-02-13T01:15:54Z DataStreaming finished after 0.030 seconds 2024-02-13T01:15:54Z Running SFKVectorizer 2024-02-13T01:15:56Z SFKVectorizer finished after 1.307 seconds 2024-02-13T01:15:56Z Running LateLegalizeInst 2024-02-13T01:15:56Z LateLegalizeInst finished after 0.014 seconds 2024-02-13T01:15:56Z Running CoalesceCCOp 2024-02-13T01:15:56Z CoalesceCCOp finished after 0.012 seconds 2024-02-13T01:15:56Z Running SimpleAllReduceTiling 2024-02-13T01:15:56Z SimpleAllReduceTiling finished after 0.011 seconds 2024-02-13T01:15:56Z Running StaticProfiler 2024-02-13T01:15:56Z StaticProfiler finished after 0.030 seconds 2024-02-13T01:15:56Z Running SplitAPUnionSets 2024-02-13T01:15:56Z SplitAPUnionSets finished after 0.091 seconds 2024-02-13T01:15:56Z Running DumpGraphAndMetadata 2024-02-13T01:15:56Z DumpGraphAndMetadata finished after 0.025 seconds 2024-02-13T01:15:56Z Running BirCodeGenLoop 2024-02-13T01:15:56Z BirCodeGenLoop finished after 0.020 seconds root = /usr/lib/python3.8/multiprocessing/process.py root = /usr/lib/python3.8/multiprocessing root = /usr/lib/python3.8 root = /usr/lib root = /usr 2024-02-13T01:15:56Z 2024-02-13T01:15:56Z Diagnostic information: 2024-02-13T01:15:56Z NeuronX Compiler version 2.12.68.0+4480452af 2024-02-13T01:15:56Z 2024-02-13T01:15:56Z Python version 3.8.10 2024-02-13T01:15:56Z HWM version 2.12.0.0-422c9037c 2024-02-13T01:15:56Z NumPy version 1.24.4 2024-02-13T01:15:56Z 2024-02-13T01:15:56Z Running on AMI ami-01257e71ecb2f431c 2024-02-13T01:15:56Z Running in region usw2-az4 2024-02-13T01:15:56Z 2024-02-13T01:15:56Z Diagnostic logs stored in /home/ubuntu/dev/Kholinar/xla/log-neuron-cc.txt
But, for comparison, this graph which is the same graph for a neighboring rank does successfully compile: rust_hlo_last_pipedepth_forward_backward_15243717295037509163_rank_25.pb.zip
The only difference is the slice window. Working:
bwd_attn0.slice.550 = f32[40,384]{1,0} slice(bwd_attn0.transpose.549), slice={[40:80], [0:384]}
Not working:
bwd_attn0.slice.550 = f32[40,384]{1,0} slice(bwd_attn0.transpose.549), slice={[80:120], [0:384]}
$ dpkg-query -W -f='${binary:Package} ${Version}\n' | grep '^aws-neuron' aws-neuronx-collectives 2.19.7.0-530fb3064 aws-neuronx-dkms 2.15.9.0 aws-neuronx-gpsimd-customop-lib 0.9.1.0 aws-neuronx-gpsimd-tools 0.9.0.0-e7d693355 aws-neuronx-oci-hook 2.2.45.0 aws-neuronx-runtime-lib 2.19.5.0-97e2d271b aws-neuronx-tools 2.16.1.0 ubuntu@trn2:~/dev/Kholinar/xla$ neuronx-cc --version NeuronX Compiler version 2.12.68.0+4480452af Python version 3.8.10 HWM version 2.12.0.0-422c9037c NumPy version 1.24.4 Running on AMI ami-01257e71ecb2f431c Running in region usw2-az4
We were able to reproduce the error on our end and are now working on the fix.
Graph: rust_hlo_last_pipedepth_forward_backward_18167585414663658369_rank_26.pb.zip
Compiler crash:
But, for comparison, this graph which is the same graph for a neighboring rank does successfully compile: rust_hlo_last_pipedepth_forward_backward_15243717295037509163_rank_25.pb.zip
The only difference is the slice window. Working:
Not working: