Alpha-VLLM / Lumina-T2X

Lumina-T2X is a unified framework for Text to Any Modality Generation
MIT License
1.82k stars 74 forks source link

Why doesn't inference time increase with pixel area? #65

Closed jiashenggu closed 1 week ago

jiashenggu commented 2 weeks ago

When I generate 1024x1024 images on one A800, inference time is 15s/image However, when I generate 1664x1664 images, inference time is 55s/image

zhuole1025 commented 2 weeks ago

I am not sure about your question. The inference time increases with the growth of resolution due to the squared complexity in attention blocks. However, we adopt Flash Attention to speed up inference.

jiashenggu commented 2 weeks ago

I am not sure about your question. The inference time increases with the growth of resolution due to the squared complexity in attention blocks. However, we adopt Flash Attention to speed up inference.

Sorry for the typo, when I generate 1664x1664 images, inference time is ~55s/image. I think the inference time should be ~39s, (~2.6x)

gaopengpjlab commented 2 weeks ago

the time cost of attention is quadratic with respect to sequence length.

jiashenggu commented 2 weeks ago

the time cost of attention is quadratic with respect to sequence length.

I understand. However, the inference time for the Lumina T2I increases at a rate faster than quadratic. When I increase the sequence length to 1.5 times, I expect the time cost to be ~2.6 times higher, but in reality, the time cost increases by ~3.6 times.

ChrisLiu6 commented 1 week ago

the time cost of attention is quadratic with respect to sequence length.

I understand. However, the inference time for the Lumina T2I increases at a rate faster than quadratic. When I increase the sequence length to 1.5 times, I expect the time cost to be ~2.6 times higher, but in reality, the time cost increases by ~3.6 times.

When increasing resolution from 1024 -> 1664, the area is 2 times larger rather than 1.5 times, and so is the number of patches.

jiashenggu commented 1 week ago

the time cost of attention is quadratic with respect to sequence length.

I understand. However, the inference time for the Lumina T2I increases at a rate faster than quadratic. When I increase the sequence length to 1.5 times, I expect the time cost to be ~2.6 times higher, but in reality, the time cost increases by ~3.6 times.

When increasing resolution from 1024 -> 1664, the area is 2 times larger rather than 1.5 times, and so is the number of patches.

Man, I said "I expect the time cost to be ~2.6 times", I know how to calculate area. The problem is the increase is greater than 2.6.

ChrisLiu6 commented 1 week ago

I mean, when increasing resolution from 1024 -> 1664, the sequence length already grows to 2.6 times (rather than 1.625) because you need to consider the area rather than the edge. Then due to the quadratic complexity of attention, the time cost can be up to 2.6^2≈7 times of the original.

jiashenggu commented 1 week ago

I mean, when increasing resolution from 1024 -> 1664, the sequence length already grows to 2.6 times (rather than 1.625) because you need to consider the area rather than the edge. Then due to the quadratic complexity of attention, the time cost can be up to 2.6^2≈7 times of the original.

Thank you for your response! I finally get it. It is different from unet, its inference time is $n^4$.