Comparison of time cost with baseline model

PanXiebit commented 1 month ago

Thank you for great work!

when I tested the time cost of linfusion and the dreamshape-v8-base model, I found that linfusion took an average of 1.8s on A100, but the average time consumption of the baseline model was 1.5s. Is this reasonable?

from diffusers import AutoPipelineForText2Image
import torch

from src.linfusion import LinFusion
from src.tools import seed_everything
from tqdm import tqdm
import os
import time

sd_repo = "dreamshaper-8"

pipeline = AutoPipelineForText2Image.from_pretrained(
    sd_repo, torch_dtype=torch.float16, variant="fp16"
).to(torch.device("cuda"))

linfusion = LinFusion.construct_for(pipeline, pretrained_model_name_or_path="LinFusion-1-5")

seed_everything(123)

with open("prompts/inference_nvidia.txt", "r") as f:
    prompts = f.readlines()

os.makedirs("outputs/linfusion", exist_ok=True)

for prompt in tqdm(prompts):
    prompt = prompt.strip()
    start_time = time.time()
    image = pipeline(
        # "An astronaut floating in space. Beautiful view of the stars and the universe in the background."
        prompt,
        num_inference_steps=50
    ).images[0]
    print("time: ", time.time() - start_time)
    image.save(f'outputs/linfusion/{prompt}.png')

from diffusers import AutoPipelineForText2Image, DEISMultistepScheduler
import torch, os
from src.tools import seed_everything
from tqdm import tqdm
import time

pipe = AutoPipelineForText2Image.from_pretrained('dreamshaper-8', torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DEISMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")

prompt = "A girl is smiling!"

seed_everything(123)
generator = torch.manual_seed(123)

with open("prompts/inference_nvidia.txt", "r") as f:
    prompts = f.readlines()

os.makedirs("outputs/dreamshape", exist_ok=True)
for prompt in tqdm(prompts):
    prompt = prompt.strip()
    start_time = time.time()
    image = pipe(prompt, generator=generator, num_inference_steps=50).images[0]  
    print("time: ", time.time() - start_time)
    image.save(f"outputs/dreamshape/{prompt}.png")

PanXiebit commented 1 month ago

Moreover, when tested on A100, the GPU memory of linfusion is 4675M, and the GPU memory of dreamshape-v8 is 4345M. This doesn't seem reasonable either?

Huage001 commented 1 month ago

Dear PanXiebit,

Thanks for raising this! We have investigated this issue recently and find that it is the PyTorch versions make the difference. Since PyTorch 2 optimizes the computation of attention through cuda, under low resolutions like 512, it can run faster than linear attention. The efficiency strength of LinFusion would be more outstanding for higher resolution.

Thank you again for the question and we will specify our test environment clearly in the paper's next version.

FunnyClown commented 1 month ago

Dear PanXiebit,

Thanks for raising this! We have investigated this issue recently and find that it is the PyTorch versions make the difference. Since PyTorch 2 optimizes the computation of attention through cuda, under low resolutions like 512, it can run faster than linear attention. The efficiency strength of LinFusion would be more outstanding for higher resolution.

Thank you again for the question and we will specify our test environment clearly in the paper's next version.

Thank you for your response! I have tried 2160 x 3840 resolution or even higher. But it seems like at all resolution the performances are similar to original SD1.5. The VAE decoding step is the most memory-consuming. Therefore, I have printed out the max memory usage before the VAE step to make sure we don't consider the VAE memory costs. But the memory usage between LinFusion and SD1.5 are still nearly the same.

Huage001 commented 1 month ago

Dear PanXiebit, Thanks for raising this! We have investigated this issue recently and find that it is the PyTorch versions make the difference. Since PyTorch 2 optimizes the computation of attention through cuda, under low resolutions like 512, it can run faster than linear attention. The efficiency strength of LinFusion would be more outstanding for higher resolution. Thank you again for the question and we will specify our test environment clearly in the paper's next version.

Thank you for your response! I have tried 2160 x 3840 resolution or even higher. But it seems like at all resolution the performances are similar to original SD1.5. The VAE decoding step is the most memory-consuming. Therefore, I have printed out the max memory usage before the VAE step to make sure we don't consider the VAE memory costs. But the memory usage between LinFusion and SD1.5 are still nearly the same.

Thanks for the question! Indeed, under PyTorch 2, the attention would not be the memory bottleneck because it applies block-wise strategy for implementation. Under this circumstance, the strength of LinFusion lies in the time efficiency for high-resolution, which supports taking the whole image for computation without applying patch-wise treatment. We will discuss the benefits to do so in details in the next version of our paper!

FunnyClown commented 1 month ago

Dear PanXiebit, Thanks for raising this! We have investigated this issue recently and find that it is the PyTorch versions make the difference. Since PyTorch 2 optimizes the computation of attention through cuda, under low resolutions like 512, it can run faster than linear attention. The efficiency strength of LinFusion would be more outstanding for higher resolution. Thank you again for the question and we will specify our test environment clearly in the paper's next version.

Thank you for your response! I have tried 2160 x 3840 resolution or even higher. But it seems like at all resolution the performances are similar to original SD1.5. The VAE decoding step is the most memory-consuming. Therefore, I have printed out the max memory usage before the VAE step to make sure we don't consider the VAE memory costs. But the memory usage between LinFusion and SD1.5 are still nearly the same.

Thanks for the question! Indeed, under PyTorch 2, the attention would not be the memory bottleneck because it applies block-wise strategy for implementation. Under this circumstance, the strength of LinFusion lies in the time efficiency for high-resolution, which supports taking the whole image for computation without applying patch-wise treatment. We will discuss the benefits to do so in details in the next version of our paper!

Thank you! I have a question regarding the memory usage during training, which I haven't had the chance to test yet. Would there be any benefits to training on high-resolution images using Linfusion? I understand that it might be faster, but I'm curious about the memory implications. Have you conducted any tests on this?

Huage001 / LinFusion

Comparison of time cost with baseline model #3