Closed jiashenggu closed 1 week ago
I am not sure about your question. The inference time increases with the growth of resolution due to the squared complexity in attention blocks. However, we adopt Flash Attention to speed up inference.
I am not sure about your question. The inference time increases with the growth of resolution due to the squared complexity in attention blocks. However, we adopt Flash Attention to speed up inference.
Sorry for the typo, when I generate 1664x1664 images, inference time is ~55s/image. I think the inference time should be ~39s, (~2.6x)
the time cost of attention is quadratic with respect to sequence length.
the time cost of attention is quadratic with respect to sequence length.
I understand. However, the inference time for the Lumina T2I increases at a rate faster than quadratic. When I increase the sequence length to 1.5 times, I expect the time cost to be ~2.6 times higher, but in reality, the time cost increases by ~3.6 times.
the time cost of attention is quadratic with respect to sequence length.
I understand. However, the inference time for the Lumina T2I increases at a rate faster than quadratic. When I increase the sequence length to 1.5 times, I expect the time cost to be ~2.6 times higher, but in reality, the time cost increases by ~3.6 times.
When increasing resolution from 1024 -> 1664, the area is 2 times larger rather than 1.5 times, and so is the number of patches.
the time cost of attention is quadratic with respect to sequence length.
I understand. However, the inference time for the Lumina T2I increases at a rate faster than quadratic. When I increase the sequence length to 1.5 times, I expect the time cost to be ~2.6 times higher, but in reality, the time cost increases by ~3.6 times.
When increasing resolution from 1024 -> 1664, the area is 2 times larger rather than 1.5 times, and so is the number of patches.
Man, I said "I expect the time cost to be ~2.6 times", I know how to calculate area. The problem is the increase is greater than 2.6.
I mean, when increasing resolution from 1024 -> 1664, the sequence length already grows to 2.6 times (rather than 1.625) because you need to consider the area rather than the edge. Then due to the quadratic complexity of attention, the time cost can be up to 2.6^2≈7 times of the original.
I mean, when increasing resolution from 1024 -> 1664, the sequence length already grows to 2.6 times (rather than 1.625) because you need to consider the area rather than the edge. Then due to the quadratic complexity of attention, the time cost can be up to 2.6^2≈7 times of the original.
Thank you for your response! I finally get it. It is different from unet, its inference time is $n^4$.
When I generate 1024x1024 images on one A800, inference time is 15s/image However, when I generate 1664x1664 images, inference time is 55s/image