aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
420 stars 136 forks source link

Quite largely increased latency with weights/neff separated #905

Open JingyaHuang opened 2 weeks ago

JingyaHuang commented 2 weeks ago

Hi team,

The Optimum Neuron team observed a quite large difference in latency when the model is compiled with non-inlined weights/neff.

TL;DR

The latency of non-inlined sd models is almost 3X that of inlined models.

Reproduction

Compilation

from optimum.neuron import NeuronStableDiffusionPipeline

model_id = "stabilityai/stable-diffusion-2-1-base"
compiler_args = {"auto_cast": "matmul", "auto_cast_type": "bf16"}
input_shapes = {"batch_size": 1, "height": 512, "width": 512}
    model_id,
    export=True,
    disable_neuron_cache=True,
    inline_weights_to_neff=False,
    # inline_weights_to_neff=True,
    # compiler_workdir="sd_intermediate",
    **compiler_args,
    **input_shapes,
)
save_directory = "sd21_neuron_matmul_bf16_non_inlined/"
stable_diffusion.save_pretrained(save_directory)

Inference

def example_prompts():
    prompts = [
        "a photo of an astronaut riding a horse on mars",
        "cute grey cat with blue eyes, wearing a bowtie, acrylic painting",
        "a cute magical flying dog, fantasy art drawn by disney concept artists, highly detailed, digital painting",
        "an illustration of a house with large barn with many cute flower pots and beautiful blue sky scenery",
        "one apple sitting on a table, still life, reflective, full color photograph, centered, close-up product",
        "background texture of stones, masterpiece, artistic, stunning photo, award winner photo",
        "new international organic style house, tropical surroundings, architecture, 8k, hdr",
        "beautiful Renaissance Revival Estate, Hobbit-House, detailed painting, warm colors, 8k, trending on Artstation",
        "blue owl, big green eyes, portrait, intricate metal design, unreal engine, octane render, realistic",
        "delicate elvish moonstone necklace on a velvet background, symmetrical intricate motifs, leaves, flowers, 8k",
    ]

    negative_prompt = "bad composition, ugly, abnormal, malformed"

    return prompts, negative_prompt

prompts, negative_prompt = example_prompts()

def inference(pipe):
    NUM_IMAGES_PER_PROMPT = 1
    steps = 30
    seed = 100
    latency_list = []

    for i, prompt in enumerate(prompts):
        inference_start = time.perf_counter()
        g = torch.Generator().manual_seed(seed)
        PIPELINE_GENERATION_CONFIG = {
            "prompt": prompt,
            "negative_prompt": negative_prompt,
            "num_inference_steps": steps,
            "num_images_per_prompt": NUM_IMAGES_PER_PROMPT,
            "guidance_scale": 7.5,
            "output_type": "pil",
            "generator": g,
        }
        image = pipe(**PIPELINE_GENERATION_CONFIG).images[0]
        inference_end = time.perf_counter()
        latency = inference_end - inference_start
        latency_list.append(latency)
        print(f"{latency:.3f} seconds")

stable_diffusion = NeuronStableDiffusionPipeline.from_pretrained("sd21_neuron_matmul_bf16_non_inlined")
inference(stable_diffusion)

Results

We already place manually the weights to Neuron devices through this PR: https://github.com/huggingface/optimum-neuron/pull/584. Is there any other things that we could and should do to improve the latency while the weights/neff are not in-lined? The current performance of non-inlined models is not encouraging.

jluntamazon commented 2 weeks ago

Hi @JingyaHuang,

When weights are not inlined, there are some effects that can reduce performance:

We can look into this specific model and see which of the above effects is causing poor performance.