Open PanXiebit opened 1 month ago
Moreover, when tested on A100, the GPU memory of linfusion is 4675M, and the GPU memory of dreamshape-v8 is 4345M. This doesn't seem reasonable either?
Dear PanXiebit,
Thanks for raising this! We have investigated this issue recently and find that it is the PyTorch versions make the difference. Since PyTorch 2 optimizes the computation of attention through cuda, under low resolutions like 512, it can run faster than linear attention. The efficiency strength of LinFusion would be more outstanding for higher resolution.
Thank you again for the question and we will specify our test environment clearly in the paper's next version.
Dear PanXiebit,
Thanks for raising this! We have investigated this issue recently and find that it is the PyTorch versions make the difference. Since PyTorch 2 optimizes the computation of attention through cuda, under low resolutions like 512, it can run faster than linear attention. The efficiency strength of LinFusion would be more outstanding for higher resolution.
Thank you again for the question and we will specify our test environment clearly in the paper's next version.
Thank you for your response! I have tried 2160 x 3840 resolution or even higher. But it seems like at all resolution the performances are similar to original SD1.5. The VAE decoding step is the most memory-consuming. Therefore, I have printed out the max memory usage before the VAE step to make sure we don't consider the VAE memory costs. But the memory usage between LinFusion and SD1.5 are still nearly the same.
Dear PanXiebit, Thanks for raising this! We have investigated this issue recently and find that it is the PyTorch versions make the difference. Since PyTorch 2 optimizes the computation of attention through cuda, under low resolutions like 512, it can run faster than linear attention. The efficiency strength of LinFusion would be more outstanding for higher resolution. Thank you again for the question and we will specify our test environment clearly in the paper's next version.
Thank you for your response! I have tried 2160 x 3840 resolution or even higher. But it seems like at all resolution the performances are similar to original SD1.5. The VAE decoding step is the most memory-consuming. Therefore, I have printed out the max memory usage before the VAE step to make sure we don't consider the VAE memory costs. But the memory usage between LinFusion and SD1.5 are still nearly the same.
Thanks for the question! Indeed, under PyTorch 2, the attention would not be the memory bottleneck because it applies block-wise strategy for implementation. Under this circumstance, the strength of LinFusion lies in the time efficiency for high-resolution, which supports taking the whole image for computation without applying patch-wise treatment. We will discuss the benefits to do so in details in the next version of our paper!
Dear PanXiebit, Thanks for raising this! We have investigated this issue recently and find that it is the PyTorch versions make the difference. Since PyTorch 2 optimizes the computation of attention through cuda, under low resolutions like 512, it can run faster than linear attention. The efficiency strength of LinFusion would be more outstanding for higher resolution. Thank you again for the question and we will specify our test environment clearly in the paper's next version.
Thank you for your response! I have tried 2160 x 3840 resolution or even higher. But it seems like at all resolution the performances are similar to original SD1.5. The VAE decoding step is the most memory-consuming. Therefore, I have printed out the max memory usage before the VAE step to make sure we don't consider the VAE memory costs. But the memory usage between LinFusion and SD1.5 are still nearly the same.
Thanks for the question! Indeed, under PyTorch 2, the attention would not be the memory bottleneck because it applies block-wise strategy for implementation. Under this circumstance, the strength of LinFusion lies in the time efficiency for high-resolution, which supports taking the whole image for computation without applying patch-wise treatment. We will discuss the benefits to do so in details in the next version of our paper!
Thank you! I have a question regarding the memory usage during training, which I haven't had the chance to test yet. Would there be any benefits to training on high-resolution images using Linfusion? I understand that it might be faster, but I'm curious about the memory implications. Have you conducted any tests on this?
Thank you for great work!
when I tested the time cost of linfusion and the dreamshape-v8-base model, I found that linfusion took an average of 1.8s on A100, but the average time consumption of the baseline model was 1.5s. Is this reasonable?