aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
421 stars 136 forks source link

Compiler Crash on Differences in Slice Op #831

Open xanderdunn opened 4 months ago

xanderdunn commented 4 months ago

Graph: rust_hlo_last_pipedepth_forward_backward_18167585414663658369_rank_26.pb.zip

Compiler crash:

Script '"neuronx-cc" "compile" "/mnt/drive1/tmp/rust_hlo_last_pipedepth_forward_backward_18167585414663658369_rank_26.pb" "--framework=XLA" "--target=trn1" "--model-type=transformer" "--internal-hlo2tensorizer-options=--verify-hlo" "--auto-cast=none" "--output" "/mnt/drive1/tmp/last_pipedepth_forward_backward_18167585414663658369_rank_26_12925239939861740968pb_8460407994547285865.neff"' failed:
2024-02-13T01:15:56Z [TEN404] (_attn0.subtract.114) Internal tensorizer error - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new
, stdout:
2024-02-13T01:15:38Z Compilation is optimized for best performance and compilation time. For faster compilation time please use -O1
2024-02-13T01:15:39Z Running DoNothing
2024-02-13T01:15:39Z DoNothing finished after 0.000 seconds
2024-02-13T01:15:39Z Running AliasDependencyInduction
2024-02-13T01:15:39Z AliasDependencyInduction finished after 0.001 seconds
2024-02-13T01:15:39Z Running CanonicalizeIR
2024-02-13T01:15:39Z CanonicalizeIR finished after 0.005 seconds
2024-02-13T01:15:39Z Running LegalizeCCOpLayout
2024-02-13T01:15:39Z LegalizeCCOpLayout finished after 0.006 seconds
2024-02-13T01:15:39Z Running ExpandBatchNorm
2024-02-13T01:15:39Z ExpandBatchNorm finished after 0.011 seconds
2024-02-13T01:15:39Z Running ResolveComplicatePredicates
2024-02-13T01:15:39Z ResolveComplicatePredicates finished after 0.008 seconds
2024-02-13T01:15:39Z Running AffinePredicateResolution
2024-02-13T01:15:39Z AffinePredicateResolution finished after 0.008 seconds
2024-02-13T01:15:39Z Running EliminateDivs
2024-02-13T01:15:39Z EliminateDivs finished after 0.006 seconds
2024-02-13T01:15:39Z Running PerfectLoopNest
2024-02-13T01:15:39Z PerfectLoopNest finished after 0.009 seconds
2024-02-13T01:15:39Z Running Simplifier
2024-02-13T01:15:39Z Simplifier finished after 0.067 seconds
2024-02-13T01:15:39Z Running GenericAccessSimplifier
2024-02-13T01:15:39Z GenericAccessSimplifier finished after 0.005 seconds
2024-02-13T01:15:39Z Running TCTransform
2024-02-13T01:15:39Z TCTransform finished after 0.010 seconds
2024-02-13T01:15:39Z Running CommuteConcat
2024-02-13T01:15:39Z CommuteConcat finished after 0.005 seconds
2024-02-13T01:15:39Z Running LowerTensorOp
2024-02-13T01:15:39Z LowerTensorOp finished after 0.075 seconds
2024-02-13T01:15:39Z Running TCTransform
2024-02-13T01:15:39Z TCTransform finished after 0.015 seconds
2024-02-13T01:15:39Z Running CanonicalizeIR
2024-02-13T01:15:39Z CanonicalizeIR finished after 0.013 seconds
2024-02-13T01:15:39Z Running TensorOpFusion
2024-02-13T01:15:39Z TensorOpFusion finished after 0.014 seconds
2024-02-13T01:15:39Z Running TensorOpTransform
2024-02-13T01:15:39Z TensorOpTransform finished after 0.049 seconds
2024-02-13T01:15:39Z Running LateLowerTensorOp
2024-02-13T01:15:39Z LateLowerTensorOp finished after 0.015 seconds
2024-02-13T01:15:39Z Running MemcpyElimination
2024-02-13T01:15:40Z MemcpyElimination finished after 0.327 seconds
2024-02-13T01:15:40Z Running LoopFusion
2024-02-13T01:15:40Z LoopFusion finished after 0.317 seconds
2024-02-13T01:15:40Z Running Simplifier
2024-02-13T01:15:40Z Simplifier finished after 0.058 seconds
2024-02-13T01:15:40Z Running Delinearization
2024-02-13T01:15:40Z Delinearization finished after 0.156 seconds
2024-02-13T01:15:40Z Running AliasDependencyElimination
2024-02-13T01:15:40Z AliasDependencyElimination finished after 0.007 seconds
2024-02-13T01:15:40Z Running DeadStoreElimination
2024-02-13T01:15:41Z DeadStoreElimination finished after 0.714 seconds
2024-02-13T01:15:41Z Running AliasDependencyInduction
2024-02-13T01:15:41Z AliasDependencyInduction finished after 0.001 seconds
2024-02-13T01:15:41Z Running Simplifier
2024-02-13T01:15:41Z Simplifier finished after 0.041 seconds
2024-02-13T01:15:41Z Running LICM
2024-02-13T01:15:41Z LICM finished after 0.016 seconds
2024-02-13T01:15:41Z Running Delinearization
2024-02-13T01:15:41Z Delinearization finished after 0.022 seconds
2024-02-13T01:15:41Z Running LoopFusion
2024-02-13T01:15:41Z LoopFusion finished after 0.159 seconds
2024-02-13T01:15:41Z Running SimplifySlice
2024-02-13T01:15:41Z SimplifySlice finished after 0.007 seconds
2024-02-13T01:15:41Z Running LICM
2024-02-13T01:15:41Z LICM finished after 0.011 seconds
2024-02-13T01:15:41Z Running Simplifier
2024-02-13T01:15:41Z Simplifier finished after 0.041 seconds
2024-02-13T01:15:41Z Running ValueNumbering
2024-02-13T01:15:41Z ValueNumbering finished after 0.020 seconds
2024-02-13T01:15:41Z Running LICM
2024-02-13T01:15:41Z LICM finished after 0.011 seconds
2024-02-13T01:15:41Z Running PadElimination
2024-02-13T01:15:41Z PadElimination finished after 0.000 seconds
2024-02-13T01:15:41Z Running Delinearization
2024-02-13T01:15:41Z Delinearization finished after 0.021 seconds
2024-02-13T01:15:41Z Running LoopFusion
2024-02-13T01:15:42Z LoopFusion finished after 0.150 seconds
2024-02-13T01:15:42Z Running GenericAccessSimplifier
2024-02-13T01:15:42Z GenericAccessSimplifier finished after 0.007 seconds
2024-02-13T01:15:42Z Running Simplifier
2024-02-13T01:15:42Z Simplifier finished after 0.020 seconds
2024-02-13T01:15:42Z Running LICM
2024-02-13T01:15:42Z LICM finished after 0.011 seconds
2024-02-13T01:15:42Z Running ValueNumbering
2024-02-13T01:15:42Z ValueNumbering finished after 0.016 seconds
2024-02-13T01:15:42Z Running TCTransform
2024-02-13T01:15:42Z TCTransform finished after 0.009 seconds
2024-02-13T01:15:42Z Running CommuteConcat
2024-02-13T01:15:42Z CommuteConcat finished after 0.008 seconds
2024-02-13T01:15:42Z Running RecognizeOpIdiom
2024-02-13T01:15:42Z RecognizeOpIdiom finished after 0.031 seconds
2024-02-13T01:15:42Z Running MaskPropagation
2024-02-13T01:15:42Z MaskPropagation finished after 0.030 seconds
2024-02-13T01:15:42Z Running DeadStoreElimination
2024-02-13T01:15:42Z DeadStoreElimination finished after 0.211 seconds
2024-02-13T01:15:42Z Running Recompute
2024-02-13T01:15:42Z Recompute finished after 0.001 seconds
2024-02-13T01:15:42Z Running DeadCodeElimination
2024-02-13T01:15:42Z DeadCodeElimination finished after 0.007 seconds
2024-02-13T01:15:42Z Running DoNothing
2024-02-13T01:15:42Z DoNothing finished after 0.000 seconds
2024-02-13T01:15:42Z Running MutateDataType
2024-02-13T01:15:42Z MutateDataType finished after 0.006 seconds
2024-02-13T01:15:42Z Running AutoCastTCInputs
2024-02-13T01:15:42Z AutoCastTCInputs finished after 0.010 seconds
2024-02-13T01:15:42Z Running GenericAccessSimplifier
2024-02-13T01:15:42Z GenericAccessSimplifier finished after 0.007 seconds
2024-02-13T01:15:42Z Running Simplifier
2024-02-13T01:15:42Z Simplifier finished after 0.020 seconds
2024-02-13T01:15:42Z Running AliasDependencyElimination
2024-02-13T01:15:42Z AliasDependencyElimination finished after 0.007 seconds
2024-02-13T01:15:42Z Running DelinearIndices
2024-02-13T01:15:42Z DelinearIndices finished after 0.342 seconds
2024-02-13T01:15:42Z Running Delinearization
2024-02-13T01:15:42Z Delinearization finished after 0.022 seconds
2024-02-13T01:15:42Z Running DelinearIndices
2024-02-13T01:15:42Z DelinearIndices finished after 0.062 seconds
2024-02-13T01:15:42Z Running DeadCodeElimination
2024-02-13T01:15:42Z DeadCodeElimination finished after 0.008 seconds
2024-02-13T01:15:42Z Running InferIntrinsicOnCC
2024-02-13T01:15:43Z InferIntrinsicOnCC finished after 0.093 seconds
2024-02-13T01:15:43Z Running ResolveAccessConflict
2024-02-13T01:15:43Z ResolveAccessConflict finished after 0.043 seconds
2024-02-13T01:15:43Z Running LICM
2024-02-13T01:15:43Z LICM finished after 0.014 seconds
2024-02-13T01:15:43Z Running LocalLayoutOpt
2024-02-13T01:15:43Z LocalLayoutOpt finished after 0.111 seconds
2024-02-13T01:15:43Z Running DelinearIndices
2024-02-13T01:15:43Z DelinearIndices finished after 0.066 seconds
2024-02-13T01:15:43Z Running PGLayoutTilingPipeline
2024-02-13T01:15:43Z Running PAGLayoutOpt
2024-02-13T01:15:43Z Running Delinearization
2024-02-13T01:15:43Z Delinearization finished after 0.023 seconds
2024-02-13T01:15:43Z PAGLayoutOpt finished after 0.440 seconds
2024-02-13T01:15:43Z Running MaskPropagation
2024-02-13T01:15:43Z MaskPropagation finished after 0.051 seconds
2024-02-13T01:15:43Z Running CanonicalizeDAGForPGTiling
2024-02-13T01:15:43Z CanonicalizeDAGForPGTiling finished after 0.034 seconds
2024-02-13T01:15:43Z Running PGTiling
2024-02-13T01:15:43Z Running AGOrderingAnalysisPass
2024-02-13T01:15:44Z AGOrderingAnalysisPass finished after 0.235 seconds
2024-02-13T01:15:44Z Running CuttingAndMacroGeneration
2024-02-13T01:15:44Z CuttingAndMacroGeneration finished after 0.563 seconds
2024-02-13T01:15:44Z PGTiling finished after 0.808 seconds
2024-02-13T01:15:44Z Running InsertIOTransposes
2024-02-13T01:15:44Z InsertIOTransposes finished after 0.089 seconds
2024-02-13T01:15:44Z PGLayoutTilingPipeline finished after 1.448 seconds
2024-02-13T01:15:44Z Running TilingProfiler
2024-02-13T01:15:44Z TilingProfiler finished after 0.087 seconds
2024-02-13T01:15:44Z Running FlattenMacroLoop
2024-02-13T01:15:45Z FlattenMacroLoop finished after 0.183 seconds
2024-02-13T01:15:45Z Running InferTongaTensor
2024-02-13T01:15:45Z InferTongaTensor finished after 0.319 seconds
2024-02-13T01:15:45Z Running TongaSimplifier
2024-02-13T01:15:45Z TongaSimplifier finished after 0.228 seconds
2024-02-13T01:15:45Z Running LICM
2024-02-13T01:15:45Z LICM finished after 0.022 seconds
2024-02-13T01:15:45Z Running RewriteReplicationMatmul
2024-02-13T01:15:45Z RewriteReplicationMatmul finished after 0.014 seconds
2024-02-13T01:15:45Z Running FlattenMacroLoop
2024-02-13T01:15:45Z FlattenMacroLoop finished after 0.048 seconds
2024-02-13T01:15:45Z Running SimplifyMacroPredicates
2024-02-13T01:15:45Z SimplifyMacroPredicates finished after 0.120 seconds
2024-02-13T01:15:45Z Running DataLocalityOpt
2024-02-13T01:15:46Z DataLocalityOpt finished after 0.502 seconds
2024-02-13T01:15:46Z Running TongaSimplifier
2024-02-13T01:15:46Z TongaSimplifier finished after 0.084 seconds
2024-02-13T01:15:46Z Running LegalizeSundaMacro
2024-02-13T01:15:46Z LegalizeSundaMacro finished after 0.042 seconds
2024-02-13T01:15:46Z Running TongaSimplifier
2024-02-13T01:15:46Z TongaSimplifier finished after 0.085 seconds
2024-02-13T01:15:46Z Running PerfectLoopNest
2024-02-13T01:15:46Z PerfectLoopNest finished after 0.016 seconds
2024-02-13T01:15:46Z Running FlattenMacroLoop
2024-02-13T01:15:46Z FlattenMacroLoop finished after 0.045 seconds
2024-02-13T01:15:46Z Running RewriteWeights
2024-02-13T01:15:46Z RewriteWeights finished after 0.019 seconds
2024-02-13T01:15:46Z Running ReshapeWeights
2024-02-13T01:15:46Z ReshapeWeights finished after 0.003 seconds
2024-02-13T01:15:46Z Running FlattenMacroLoop
2024-02-13T01:15:46Z FlattenMacroLoop finished after 0.018 seconds
2024-02-13T01:15:46Z Running SimplifyMacroPredicates
2024-02-13T01:15:46Z SimplifyMacroPredicates finished after 0.150 seconds
2024-02-13T01:15:46Z Running InferInitValue
2024-02-13T01:15:47Z InferInitValue finished after 0.777 seconds
2024-02-13T01:15:47Z Running TongaSimplifier
2024-02-13T01:15:47Z TongaSimplifier finished after 0.081 seconds
2024-02-13T01:15:47Z Running SimplifyTensor
2024-02-13T01:15:47Z SimplifyTensor finished after 0.069 seconds
2024-02-13T01:15:47Z Running LICM
2024-02-13T01:15:47Z LICM finished after 0.024 seconds
2024-02-13T01:15:47Z Running SundaISel
2024-02-13T01:15:48Z SundaISel finished after 0.289 seconds
2024-02-13T01:15:48Z Running LowerThorKernels
2024-02-13T01:15:48Z LowerThorKernels finished after 0.009 seconds
2024-02-13T01:15:48Z Running TongaLoopInterchange
2024-02-13T01:15:48Z TongaLoopInterchange finished after 0.009 seconds
2024-02-13T01:15:48Z Running TongaSimplifyPredicates
2024-02-13T01:15:48Z TongaSimplifyPredicates finished after 0.009 seconds
2024-02-13T01:15:48Z Running TongaLoopFusion
2024-02-13T01:15:48Z TongaLoopFusion finished after 0.282 seconds
2024-02-13T01:15:48Z Running TongaLoopInterchange
2024-02-13T01:15:48Z TongaLoopInterchange finished after 0.008 seconds
2024-02-13T01:15:48Z Running TongaLICM
2024-02-13T01:15:48Z TongaLICM finished after 0.036 seconds
2024-02-13T01:15:48Z Running FactorizeBlkDims
2024-02-13T01:15:48Z FactorizeBlkDims finished after 0.058 seconds
2024-02-13T01:15:48Z Running TongaInstComb
2024-02-13T01:15:50Z TongaInstComb finished after 2.147 seconds
2024-02-13T01:15:50Z Running TongaValueNumbering
2024-02-13T01:15:50Z TongaValueNumbering finished after 0.024 seconds
2024-02-13T01:15:50Z Running TongaInstComb
2024-02-13T01:15:52Z TongaInstComb finished after 1.929 seconds
2024-02-13T01:15:52Z Running VectorizeDMA
2024-02-13T01:15:52Z VectorizeDMA finished after 0.029 seconds
2024-02-13T01:15:52Z Running TongaSimplifyPredicates
2024-02-13T01:15:52Z TongaSimplifyPredicates finished after 0.007 seconds
2024-02-13T01:15:52Z Running LegalizePartitionReduce
2024-02-13T01:15:52Z LegalizePartitionReduce finished after 0.014 seconds
2024-02-13T01:15:52Z Running DeConcat
2024-02-13T01:15:52Z DeConcat finished after 0.012 seconds
2024-02-13T01:15:52Z Running PartialSimdFusion
2024-02-13T01:15:52Z PartialSimdFusion finished after 0.120 seconds
2024-02-13T01:15:52Z Running TritiumFusion
2024-02-13T01:15:52Z TritiumFusion finished after 0.027 seconds
2024-02-13T01:15:52Z Running CCOpFusion
2024-02-13T01:15:53Z CCOpFusion finished after 0.070 seconds
2024-02-13T01:15:53Z Running VectorizeMatMult
2024-02-13T01:15:53Z VectorizeMatMult finished after 0.003 seconds
2024-02-13T01:15:53Z Running PartialLoopFusion
2024-02-13T01:15:53Z PartialLoopFusion finished after 0.109 seconds
2024-02-13T01:15:53Z Running TongaLICM
2024-02-13T01:15:53Z TongaLICM finished after 0.031 seconds
2024-02-13T01:15:53Z Running LowerTranspose
2024-02-13T01:15:53Z LowerTranspose finished after 0.126 seconds
2024-02-13T01:15:53Z Running LateTongaInstComb
2024-02-13T01:15:53Z LateTongaInstComb finished after 0.143 seconds
2024-02-13T01:15:53Z Running SplitAccGrp
2024-02-13T01:15:53Z SplitAccGrp finished after 0.007 seconds
2024-02-13T01:15:53Z Running SpillPSum
2024-02-13T01:15:53Z SpillPSum finished after 0.136 seconds
2024-02-13T01:15:53Z Running LowerIntrinsics
2024-02-13T01:15:53Z LowerIntrinsics finished after 0.019 seconds
2024-02-13T01:15:53Z Running LegalizeType
2024-02-13T01:15:53Z LegalizeType finished after 0.018 seconds
2024-02-13T01:15:53Z Running TongaLICM
2024-02-13T01:15:53Z TongaLICM finished after 0.035 seconds
2024-02-13T01:15:53Z Running InferPSumTensor
2024-02-13T01:15:53Z InferPSumTensor finished after 0.236 seconds
2024-02-13T01:15:53Z Running WeightCoalescing
2024-02-13T01:15:54Z WeightCoalescing finished after 0.009 seconds
2024-02-13T01:15:54Z Running LegalizeSundaAccess
2024-02-13T01:15:54Z LegalizeSundaAccess finished after 0.074 seconds
2024-02-13T01:15:54Z Running TernaryFission
2024-02-13T01:15:54Z TernaryFission finished after 0.438 seconds
2024-02-13T01:15:54Z Running RelaxPredicates
2024-02-13T01:15:54Z RelaxPredicates finished after 0.017 seconds
2024-02-13T01:15:54Z Running TensorInitialization
2024-02-13T01:15:54Z TensorInitialization finished after 0.165 seconds
2024-02-13T01:15:54Z Running TongaSimplifyPredicates
2024-02-13T01:15:54Z TongaSimplifyPredicates finished after 0.031 seconds
2024-02-13T01:15:54Z Running ExpandISAMacro
2024-02-13T01:15:54Z ExpandISAMacro finished after 0.024 seconds
2024-02-13T01:15:54Z Running SimplifyTongaTensor
2024-02-13T01:15:54Z SimplifyTongaTensor finished after 0.062 seconds
2024-02-13T01:15:54Z Running DMALocalityOpt
2024-02-13T01:15:54Z DMALocalityOpt finished after 0.006 seconds
2024-02-13T01:15:54Z Running DataStreaming
2024-02-13T01:15:54Z DataStreaming finished after 0.030 seconds
2024-02-13T01:15:54Z Running SFKVectorizer
2024-02-13T01:15:56Z SFKVectorizer finished after 1.307 seconds
2024-02-13T01:15:56Z Running LateLegalizeInst
2024-02-13T01:15:56Z LateLegalizeInst finished after 0.014 seconds
2024-02-13T01:15:56Z Running CoalesceCCOp
2024-02-13T01:15:56Z CoalesceCCOp finished after 0.012 seconds
2024-02-13T01:15:56Z Running SimpleAllReduceTiling
2024-02-13T01:15:56Z SimpleAllReduceTiling finished after 0.011 seconds
2024-02-13T01:15:56Z Running StaticProfiler
2024-02-13T01:15:56Z StaticProfiler finished after 0.030 seconds
2024-02-13T01:15:56Z Running SplitAPUnionSets
2024-02-13T01:15:56Z SplitAPUnionSets finished after 0.091 seconds
2024-02-13T01:15:56Z Running DumpGraphAndMetadata
2024-02-13T01:15:56Z DumpGraphAndMetadata finished after 0.025 seconds
2024-02-13T01:15:56Z Running BirCodeGenLoop
2024-02-13T01:15:56Z BirCodeGenLoop finished after 0.020 seconds
root = /usr/lib/python3.8/multiprocessing/process.py
root = /usr/lib/python3.8/multiprocessing
root = /usr/lib/python3.8
root = /usr/lib
root = /usr
2024-02-13T01:15:56Z
2024-02-13T01:15:56Z Diagnostic information:
2024-02-13T01:15:56Z   NeuronX Compiler version 2.12.68.0+4480452af
2024-02-13T01:15:56Z
2024-02-13T01:15:56Z   Python version 3.8.10
2024-02-13T01:15:56Z   HWM version 2.12.0.0-422c9037c
2024-02-13T01:15:56Z   NumPy version 1.24.4
2024-02-13T01:15:56Z
2024-02-13T01:15:56Z   Running on AMI ami-01257e71ecb2f431c
2024-02-13T01:15:56Z   Running in region usw2-az4
2024-02-13T01:15:56Z
2024-02-13T01:15:56Z Diagnostic logs stored in /home/ubuntu/dev/Kholinar/xla/log-neuron-cc.txt

But, for comparison, this graph which is the same graph for a neighboring rank does successfully compile: rust_hlo_last_pipedepth_forward_backward_15243717295037509163_rank_25.pb.zip

The only difference is the slice window. Working:

  bwd_attn0.slice.550 = f32[40,384]{1,0} slice(bwd_attn0.transpose.549), slice={[40:80], [0:384]}

Not working:

  bwd_attn0.slice.550 = f32[40,384]{1,0} slice(bwd_attn0.transpose.549), slice={[80:120], [0:384]}
$ dpkg-query -W -f='${binary:Package} ${Version}\n' | grep '^aws-neuron'
aws-neuronx-collectives 2.19.7.0-530fb3064
aws-neuronx-dkms 2.15.9.0
aws-neuronx-gpsimd-customop-lib 0.9.1.0
aws-neuronx-gpsimd-tools 0.9.0.0-e7d693355
aws-neuronx-oci-hook 2.2.45.0
aws-neuronx-runtime-lib 2.19.5.0-97e2d271b
aws-neuronx-tools 2.16.1.0
ubuntu@trn2:~/dev/Kholinar/xla$ neuronx-cc --version
NeuronX Compiler version 2.12.68.0+4480452af

Python version 3.8.10
HWM version 2.12.0.0-422c9037c
NumPy version 1.24.4

Running on AMI ami-01257e71ecb2f431c
Running in region usw2-az4
aws-rhsoln commented 4 months ago

We were able to reproduce the error on our end and are now working on the fix.