aws-neuron / aws-neuron-samples

Example code for AWS Neuron SDK developers building inference and training applications
Other
101 stars 32 forks source link

Can't compile Stable Diffusion 2.1. 512x512 for inference #10

Closed luiscape closed 1 year ago

luiscape commented 1 year ago

I am following the example notebook Stable Diffusion 2.1 512x512 but can't compile the model using a inf2.xlarge instance.

After a number of correctly compiled steps that look like the following:

Compiler status PASS

I get an error message:

2023-05-04 19:31:29.000758: INFO ||NCC_WRAPPER||: Exiting with a successfully compiled graph
Traceback (most recent call last):
  File "/pkg/modal/_container_entrypoint.py", line 329, in handle_input_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 402, in call_function_sync
    res = fun(*args, **kwargs)
  File "/root/sd_2_1_inf.py", line 152, in compile_model
    decoder_neuron = torch_neuronx.trace(
  File "/usr/local/lib/python3.9/site-packages/torch_neuronx/xla_impl/trace.py", line 309, in trace
    neff_filename = hlo_compile(model_dir, compiler_workdir, compiler_args)
  File "/usr/local/lib/python3.9/site-packages/torch_neuronx/xla_impl/trace.py", line 232, in hlo_compile
    raise RuntimeError(f'neuronx-cc failed with {status}')
RuntimeError: neuronx-cc failed with -9

Is this a known issue? What's the recommended setup in terms of library versions and instance types to be able to compile Stable Diffusion 2.1?

awsilya commented 1 year ago

@luiscape

This is likely caused by the instance running out of memory. I recommend that you try on an inf2.8xlarge instance or larger since compiling this model consumes a large amount of memory. An inf2.xlarge may be sufficient to infer with the model, but we have found that compilation requires more CPU memory.

luiscape commented 1 year ago

@awsilya thank you! I'll give it a try.

aws-mvaria commented 1 year ago

Hi @luiscape , I am closing this issue, but feel free to reopen if you require further support.