aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
468 stars 154 forks source link

Model compilation stuck #730

Open andreadasilvabaudet opened 1 year ago

andreadasilvabaudet commented 1 year ago

Hi team,

I'm trying to compile a model using a inf1.6xlarge instance, but the compilation gets stuck and it does not finish. These are the last lines of the compilation logs:

08/21/2023 07:03:05 PM INFO 25700 [job.HHChecker.0]: Job finished
08/21/2023 07:03:05 PM INFO 25700 [pipeline.compile.0]: Finished job job.HHChecker.0 with state 0
08/21/2023 07:03:05 PM INFO 25700 [pipeline.compile.0]: Starting job job.WalrusDriver.3 state state 0
08/21/2023 07:03:05 PM INFO 25700 [job.WalrusDriver.3]: Replay this job by calling: /home/ec2-user/aws_neuron_venv_pytorch_inf1/bin/neuron-cc compile --framework TENSORFLOW --state '{"model": ["/home/ec2-user/Wav2Lip/neuron-compilation/0/graph_def.pb"], "tensormap": "tensor_map.json", "bir": "bir.json", "state_dir": "/home/ec2-user/Wav2Lip/neuron-compilation/0/sg00", "state_id": "sg00"}' --pipeline WalrusDriver
08/21/2023 07:03:05 PM INFO 25700 [job.WalrusDriver.3]: /home/ec2-user/aws_neuron_venv_pytorch_inf1/lib64/python3.7/site-packages/neuroncc/starfish/bin/walrus_driver --state-numerical-id=0 --optlevel 2 --allocator coloring --verbose 20 -o walrus_bir.out.json -i bir.json --min_split_size 10240 --skip_split_vns  --no_split_dram --max-partitions 16 --policy 2 --auxflag 0 --interleave none --internal-hyper-parameters /opt/ml/input/config/hyperparameters.json --tuning 2 --numcores 16 --enable_partitioner --unified-walrus-and-stargazer --tensor-map tensor_map.json --act-root-json /home/ec2-user/aws_neuron_venv_pytorch_inf1/lib64/python3.7/site-packages/neuroncc/pwp/pwp_bin_with_ln/act_info.json
08/21/2023 07:03:05 PM INFO [WalrusDriver.0]: max_allowed_parallelism=24
08/21/2023 07:03:05 PM INFO [WalrusDriver.0]: Running walrus pass: unroll
08/21/2023 07:03:05 PM INFO [WalrusDriver.0]: Input to unroll: modules=1 functions=1 allocs=423 blocks=1 instructions=99
08/21/2023 07:03:05 PM INFO [WalrusDriver.0]: INFO (Unroll) Start unrolling at Mon Aug 21 19:03:05 2023
08/21/2023 07:03:26 PM INFO [WalrusDriver.0]: INFO (Unroll) DONE unrolling Mon Aug 21 19:03:05 2023
08/21/2023 07:03:26 PM INFO [WalrusDriver.0]: Instruction count after Unroll: 
08/21/2023 07:03:26 PM INFO [WalrusDriver.0]: Total count: 1662675
08/21/2023 07:03:26 PM INFO [WalrusDriver.0]: Matmult: 1364342
08/21/2023 07:03:26 PM INFO [WalrusDriver.0]: TensorScalarPtr: 112740
08/21/2023 07:03:26 PM INFO [WalrusDriver.0]: Activation: 58930
08/21/2023 07:03:26 PM INFO [WalrusDriver.0]: Load: 40646
08/21/2023 07:03:26 PM INFO [WalrusDriver.0]: TensorCopy: 39676
08/21/2023 07:03:26 PM INFO [WalrusDriver.0]: TensorTensor: 32286
08/21/2023 07:03:26 PM INFO [WalrusDriver.0]: Memset: 9472
08/21/2023 07:03:26 PM INFO [WalrusDriver.0]: Save: 4583
08/21/2023 07:03:26 PM INFO [WalrusDriver.0]: ru_maxrss:  7888mb (delta=2619mb)
08/21/2023 07:03:26 PM INFO [WalrusDriver.0]: Walrus pass: unroll succeeded!
08/21/2023 07:03:26 PM INFO [WalrusDriver.0]: Output has 1 module(s), 1 function(s), 257136 memory location(s), 1 block(s), and 1662675 instruction(s).
08/21/2023 07:03:26 PM INFO [WalrusDriver.0]: Running walrus pass: birverifier
08/21/2023 07:03:26 PM INFO [WalrusDriver.0]: Input to birverifier: modules=1 functions=1 allocs=257136 blocks=1 instructions=1662675
08/21/2023 07:03:29 PM INFO [WalrusDriver.0]: ru_maxrss:  8461mb (delta=573mb)
08/21/2023 07:03:29 PM INFO [WalrusDriver.0]: Walrus pass: birverifier succeeded!
08/21/2023 07:03:29 PM INFO [WalrusDriver.0]: Output has 1 module(s), 1 function(s), 257136 memory location(s), 1 block(s), and 1662675 instruction(s).
08/21/2023 07:03:29 PM INFO [WalrusDriver.0]: Running walrus pass: vn_splitter
08/21/2023 07:03:29 PM INFO [WalrusDriver.0]: Input to vn_splitter: modules=1 functions=1 allocs=257136 blocks=1 instructions=1662675
08/21/2023 07:03:29 PM INFO [WalrusDriver.0]: INFO (VNSplitter) Collected all the internal vnodes: size = 3
08/21/2023 07:03:29 PM INFO [WalrusDriver.0]: INFO (VNSplitter::analyze) analyze vn = Sequential_50/Conv2d_5/Sequential_2/Conv2d_2/aten__convolution/transpose:0, (max_duplication_factor = 2, max_unsplit_size = 98304, min_split_size = 10240)
08/21/2023 07:03:29 PM INFO [WalrusDriver.0]: INFO (VNSplitter): Not splitting VN Sequential_50/Conv2d_5/Sequential_2/Conv2d_2/aten__convolution/transpose:0 because it's DRAM only.
08/21/2023 07:03:29 PM INFO [WalrusDriver.0]: INFO (VNSplitter::analyze) analyze vn = Sequential_54/Conv2dTranspose_3/Sequential_2/ConvTranspose2d_2/aten__convolution/transpose:0, (max_duplication_factor = 2, max_unsplit_size = 98304, min_split_size = 10240)
08/21/2023 07:03:29 PM INFO [WalrusDriver.0]: INFO (VNSplitter): Not splitting VN Sequential_54/Conv2dTranspose_3/Sequential_2/ConvTranspose2d_2/aten__convolution/transpose:0 because it's DRAM only.
08/21/2023 07:03:29 PM INFO [WalrusDriver.0]: INFO (VNSplitter::analyze) analyze vn = Sequential_66/Conv2d_3/Sequential_2/Conv2d_2/aten__convolution/transpose:0, (max_duplication_factor = 2, max_unsplit_size = 98304, min_split_size = 10240)
08/21/2023 07:03:29 PM INFO [WalrusDriver.0]: INFO (VNSplitter): Not splitting VN Sequential_66/Conv2d_3/Sequential_2/Conv2d_2/aten__convolution/transpose:0 because it's DRAM only.
08/21/2023 07:03:29 PM INFO [WalrusDriver.0]: INFO (VNSplitter) Done with analyze and splitting: total dead nodes = 0
08/21/2023 07:03:36 PM INFO [WalrusDriver.0]: number of penguin non-local-tensor caused reload left 7346
08/21/2023 07:03:36 PM INFO [WalrusDriver.0]: number of penguin non-local-tensor caused spill left 0
08/21/2023 07:03:36 PM INFO [WalrusDriver.0]: INFO (VNSplitter) Time: 0.019 seconds
08/21/2023 07:03:36 PM INFO [WalrusDriver.0]: INFO (VerticalFusion) Time: 3.498 seconds
08/21/2023 07:03:36 PM INFO [WalrusDriver.0]: INFO (ShrinkDN) Time: 3.095 seconds
08/21/2023 07:03:36 PM INFO [WalrusDriver.0]: INFO (LowerAC) Time: -0.001 seconds
08/21/2023 07:03:36 PM INFO [WalrusDriver.0]: ru_maxrss:  8488mb (delta=27mb)
08/21/2023 07:03:36 PM INFO [WalrusDriver.0]: Walrus pass: vn_splitter succeeded!
08/21/2023 07:03:37 PM INFO [WalrusDriver.0]: Output has 1 module(s), 1 function(s), 257136 memory location(s), 1 block(s), and 1662675 instruction(s).
08/21/2023 07:03:37 PM INFO [WalrusDriver.0]: Running walrus pass: lower_ac
08/21/2023 07:03:37 PM INFO [WalrusDriver.0]: Input to lower_ac: modules=1 functions=1 allocs=257136 blocks=1 instructions=1662675
08/21/2023 07:03:37 PM INFO [WalrusDriver.0]: INFO (LowerAC) Lowered 40646 loads, 4583 saves, 0 copies.
08/21/2023 07:03:37 PM INFO [WalrusDriver.0]: ru_maxrss:  8551mb (delta=63mb)
08/21/2023 07:03:37 PM INFO [WalrusDriver.0]: Walrus pass: lower_ac succeeded!
08/21/2023 07:03:37 PM INFO [WalrusDriver.0]: Output has 1 module(s), 1 function(s), 257136 memory location(s), 1 block(s), and 1662675 instruction(s).
08/21/2023 07:03:37 PM INFO [WalrusDriver.0]: Running walrus pass: pre_sched
08/21/2023 07:03:37 PM INFO [WalrusDriver.0]: Input to pre_sched: modules=1 functions=1 allocs=257136 blocks=1 instructions=1662675

I'm compiling using the following script:

import torch
import torch_neuron

image = (torch.zeros([128, 1, 80, 16], dtype=torch.float32), torch.zeros([128, 6, 96, 96], dtype=torch.float32))

model_neuron = torch.neuron.trace(model, example_inputs=image, compiler_workdir='./neuron-compilation', verbose=1, dynamic_batch_size=True, compiler_args=['--neuroncore-pipeline-cores', '16'])

model_neuron.save("model.pt")

Thank you in advance.

jluntamazon commented 1 year ago

Hello,

Can you give us more information about the failing model?

If this is something we can attempt to reproduce on our end we will see if we can provide a fix. Ideally you can provide an open source model (or inline code) where this behavior is reproducible.

shreyazomato commented 1 year ago

facing same issue when trying to compile vae encoder of sd 2.1 inpainting model, rest all parts of the model i am able to compile but not vae encoder . Using inf2.8xlarge

my screenshots 2023-09-05 at 5 08 14 PM my screenshots 2023-09-05 at 5 08 58 PM
recog-arch commented 1 year ago

Same problem here, any solution to this so far?

recog-arch commented 1 year ago

@jluntamazon I can provide a minimal example if that helps

aws-donkrets commented 1 year ago

Hi recog-arch, The Stable Diffusion model is not supported on our inf1 architecture. However we have done some work to get it runnable on our inf2 architecture. See https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/benchmarks/inf2/inf2-performance.html?highlight=stable%20diff So, as a next step I would suggest try compiling your model on that instance.

mrnikwaws commented 1 year ago

Still investigating this one on inf2. Using the following test code:

import torch_neuronx
import torch

from diffusers import StableDiffusionInpaintPipeline

pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-inpainting",
    torch_dtype=torch.float32,
)

decoder = pipe.vae.decoder
decoder_in = torch.randn([1,4,64,96])

decoder_neuron = torch_neuronx.trace(
    decoder, decoder_in, compiler_workdir="decoder_compile", compiler_args="--verbose info"
)

Please let us know if this does not represent the problem use case.