ChristophReich1996 / Swin-Transformer-V2

PyTorch reimplementation of the paper "Swin Transformer V2: Scaling Up Capacity and Resolution" [CVPR 2022].
https://arxiv.org/abs/2111.09883
MIT License
173 stars 14 forks source link

Problems encountered #3

Closed Breeze-Zero closed 2 years ago

Breeze-Zero commented 2 years ago

Hello, I have encountered some small problems when using SwinV2 these days. I would like to get your answers here.

  1. when my input size is small, such as 96*96, window_szie=8, will appear https://github.com/ChristophReich1996/Swin-Transformer-V2/blob/cff3824f5d75dfd93553867efcf53562c24dd555/swin_transformer_v2/model_parts.py#L289 RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

  2. In this paper, sequential self-attention calculation is used to save GPU memory, but in large-size image application, I set Sequential_self_attention =True, which will result in OOM. Set Sequential_self_attention =False does not.

  3. when i update the code https://github.com/ChristophReich1996/Swin-Transformer-V2/blob/cff3824f5d75dfd93553867efcf53562c24dd555/swin_transformer_v2/model_parts.py#L207,GPU memory usage also soared further.

I am currently experimenting with the application of swinV2, an efficient and memory saving network, to 3D data, therefore pay more attention to the occupation problem on the display memory.

ChristophReich1996 commented 2 years ago

Hi @834799106,

Thanks for your interest in my reimplementation, and thanks for pointing out these issues. Feel free to leave a star there if you liked the code :)

  1. I have been able to reproduce the bug. It appears to be that for some (not all) model configurations and input shapes the input tensor to the window attention is not contiguous in memory (see for more information), as also stated by the error message. Why this is, especially only for some shapes, is currently not clear to me but simply replacing the view with a reshaping solved the issue.
  2. As mentioned in the README this is a very experimental implementation, and so the sequential self-attention implementation. The paper only states that they are employing a sequential self-attention but nothing more (see section 3.4, paragraph sequential self-attention computation of the paper). One could imagine more possible implementation, for example, sequential iterating over the number of attention heads. For me, the most obvious implementation was to iterate over the query tokens. I don't think that the sequential self-attention itself saves the memory but the combination with clever checkpointing. This is because in the current sequential self-attention probably all tensors are stored for the backward pass. But I haven't had time to look deeper into it. Maybe you find a way to make the sequential implementation to be more memory efficient, probably in combination with checkpointing. Feel free to open a pull request if you found a solution!
  3. This change refers to issue #1. As mentioned there every entry of the attention matrix gets normalized now, resulting in a bigger tensor. This also leads to a bigger memory footprint.

Cheers Christoph