aws-neuron / aws-neuron-samples

Example code for AWS Neuron SDK developers building inference and training applications
Other
122 stars 33 forks source link

Can't compile SD2.1 VAE with Batch Input #11

Open furkancoskun opened 1 year ago

furkancoskun commented 1 year ago

I have changed the batch sizes of the trace tensor inputs in hf_pretrained_sd2_512_inference.ipynb notebook. Although text encoder, unet and vae_post_quant_conv were compiled, vae wasn't compiled.

batch=2

import torch_neuronx
from diffusers import StableDiffusionPipeline
import torch
import os, copy

COMPILER_WORKDIR_ROOT = 'sd2_compile_dir_512_batch2'
model_id = "stabilityai/stable-diffusion-2-1-base"

pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
decoder = copy.deepcopy(pipe.vae.decoder)
del pipe

decoder_in = torch.randn([2, 4, 64, 64])
decoder_neuron = torch_neuronx.trace(
    decoder, 
    decoder_in, 
    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder'),
)

decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
torch.jit.save(decoder_neuron, decoder_filename)

del decoder
del decoder_neuron

I get error message:

Fetching 13 files: 100% ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 13/13 [00:00<00:00, 234016.96it/s] Selecting 161763 allocations 0% 10 20 30 40 50 60 70 80 90 100%
Selecting 137856 allocations 0% 10 20 30 40 50 60 70 80 90 100% ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- Selecting 272047 allocations 0% 10 20 30 40 50 60 70 80 90 100%
Selecting 52275 allocations 0% 10 20 30 40 50 60 70 80 90 100% ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- Selecting 318165 allocations 0% 10 20 30 40 50 60 70 80 90 100%
Selecting 8981 allocations 0% 10 20 30 40 50 60 70 80 90 100% ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- Selecting 323589 allocations 0% 10 20 30 40 50 60 70 80 90 100%
2023-05-22T06:51:17Z WARNING 28201 [SB_Allocator]: couldn't allocate every tensor in SB 2023-05-22T06:51:17Z WARNING 28201 [SB_Allocator]: disabling special handling of accumulation groups Selecting 323589 allocations 0% 10 20 30 40 50 60 70 80 90 100% ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- Selecting 2233 allocations 0% 10 20 30 40 50 60 70 80 90 100%

Selecting 325190 allocations 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| 2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: couldn't allocate every tensor in SB and spilling can't help 2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: 10 biggest memlocs: 2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_312_pftranspose_5198_i6_ReloadStore32338_ReloadStore166495 65536 2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]:
mhlo_add_312_pftranspose_5198_i0_ReloadStore32560_Remat_166496 65536 2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i5_ReloadStore32107_Remat_166430 65536 2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_259_i0 65536 2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i7 65536 2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i1 65536 2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i6 65536 2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i5_ReloadStore32107_Remat_166431 65536 2023-05-22T06:51:39Z FATAL 28201 [SB_Allocator]: mhlo_add_294_i7_ReloadStore32024_Remat_121920_Remat_166327 65536

I have used inf2.8xlarge instance and set 100GB swap space. Any ideas on this batch input compilation problem?

jyang-aws commented 1 year ago

Hi furkancoskun, Thanks for reporting the issue. We'll try to reproduce and look into it. just to confirm, the issue shows up in the latest 2.10 neuron-sdk?

furkancoskun commented 1 year ago

Yes, the issue shows up in 2.10

aws-mvaria commented 1 year ago

Hi @furkancoskun , We have reproduced the issue and are currently looking at fixing this in a future release. However, you can continue to use batch=1 in the meantime.

If you are looking to use higher batch sizes to improve performance, note that our batch=1 configuration is expected to be performant. We will continue to improve batch=1 performance as well as support multiple batches in future releases.