Quite largely increased latency with weights/neff separated

Hi team,

The Optimum Neuron team observed a quite large difference in latency when the model is compiled with non-inlined weights/neff.

TL;DR

The latency of non-inlined sd models is almost 3X that of inlined models.

Reproduction

Compilation

from optimum.neuron import NeuronStableDiffusionPipeline

model_id = "stabilityai/stable-diffusion-2-1-base"
compiler_args = {"auto_cast": "matmul", "auto_cast_type": "bf16"}
input_shapes = {"batch_size": 1, "height": 512, "width": 512}
    model_id,
    export=True,
    disable_neuron_cache=True,
    inline_weights_to_neff=False,
    # inline_weights_to_neff=True,
    # compiler_workdir="sd_intermediate",
    **compiler_args,
    **input_shapes,
)
save_directory = "sd21_neuron_matmul_bf16_non_inlined/"
stable_diffusion.save_pretrained(save_directory)

Inference

def example_prompts():
    prompts = [
        "a photo of an astronaut riding a horse on mars",
        "cute grey cat with blue eyes, wearing a bowtie, acrylic painting",
        "a cute magical flying dog, fantasy art drawn by disney concept artists, highly detailed, digital painting",
        "an illustration of a house with large barn with many cute flower pots and beautiful blue sky scenery",
        "one apple sitting on a table, still life, reflective, full color photograph, centered, close-up product",
        "background texture of stones, masterpiece, artistic, stunning photo, award winner photo",
        "new international organic style house, tropical surroundings, architecture, 8k, hdr",
        "beautiful Renaissance Revival Estate, Hobbit-House, detailed painting, warm colors, 8k, trending on Artstation",
        "blue owl, big green eyes, portrait, intricate metal design, unreal engine, octane render, realistic",
        "delicate elvish moonstone necklace on a velvet background, symmetrical intricate motifs, leaves, flowers, 8k",
    ]

    negative_prompt = "bad composition, ugly, abnormal, malformed"

    return prompts, negative_prompt

prompts, negative_prompt = example_prompts()

def inference(pipe):
    NUM_IMAGES_PER_PROMPT = 1
    steps = 30
    seed = 100
    latency_list = []

    for i, prompt in enumerate(prompts):
        inference_start = time.perf_counter()
        g = torch.Generator().manual_seed(seed)
        PIPELINE_GENERATION_CONFIG = {
            "prompt": prompt,
            "negative_prompt": negative_prompt,
            "num_inference_steps": steps,
            "num_images_per_prompt": NUM_IMAGES_PER_PROMPT,
            "guidance_scale": 7.5,
            "output_type": "pil",
            "generator": g,
        }
        image = pipe(**PIPELINE_GENERATION_CONFIG).images[0]
        inference_end = time.perf_counter()
        latency = inference_end - inference_start
        latency_list.append(latency)
        print(f"{latency:.3f} seconds")

stable_diffusion = NeuronStableDiffusionPipeline.from_pretrained("sd21_neuron_matmul_bf16_non_inlined")
inference(stable_diffusion)

Results

Inlined models

(aws_neuron_venv2.18_pt212) ubuntu@ip-172-31-33-90:~/optimum-neuron$ python test_sd_bench.py 
/home/ubuntu/pyvenv/aws_neuron_venv2.18_pt212/lib/python3.8/site-packages/diffusers/models/transformers/transformer_2d.py:34: FutureWarning: `Transformer2DModelOutput` is deprecated and will be removed in version 1.0.0. Importing `Transformer2DModelOutput` from `diffusers.models.transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.modeling_outputs import Transformer2DModelOutput`, instead.
deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
Loading only U-Net into both Neuron Cores...
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:03<00:00,  7.62it/s]
End2End took 4.758 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 30.22it/s]
End2End took 1.144 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 30.17it/s]
End2End took 1.145 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 30.21it/s]
End2End took 1.143 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 30.25it/s]
End2End took 1.142 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 30.20it/s]
End2End took 1.144 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 30.23it/s]
End2End took 1.142 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 30.20it/s]
End2End took 1.143 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 30.19it/s]
End2End took 1.144 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 30.17it/s]
End2End took 1.144 seconds

non-inlined models

/home/ubuntu/pyvenv/aws_neuron_venv2.18_pt212/lib/python3.8/site-packages/diffusers/models/transformers/transformer_2d.py:34: FutureWarning: `Transformer2DModelOutput` is deprecated and will be removed in version 1.0.0. Importing `Transformer2DModelOutput` from `diffusers.models.transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.modeling_outputs import Transformer2DModelOutput`, instead.
deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
Loading only U-Net into both Neuron Cores...
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:06<00:00,  4.34it/s]
End2End took 8.202 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:03<00:00,  9.21it/s]
End2End took 3.560 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:03<00:00,  9.21it/s]
End2End took 3.430 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:03<00:00,  9.19it/s]
End2End took 3.437 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:03<00:00,  9.21it/s]
End2End took 3.430 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:03<00:00,  9.21it/s]
End2End took 3.430 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:03<00:00,  9.20it/s]
End2End took 3.434 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:03<00:00,  9.21it/s]
End2End took 3.430 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:03<00:00,  9.21it/s]
End2End took 3.428 seconds
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:03<00:00,  9.21it/s]
End2End took 3.429 seconds

We already place manually the weights to Neuron devices through this PR: https://github.com/huggingface/optimum-neuron/pull/584. Is there any other things that we could and should do to improve the latency while the weights/neff are not in-lined? The current performance of non-inlined models is not encouraging.

aws-neuron / aws-neuron-sdk