Closed shkarupa-alex closed 2 years ago
In the original Swin implementation last BasicLayer with 2 SwinTransformerBlock's does not uses attention mask:
But your SwinTransformerBlock implementation does not uses such condition and first SwinTransformerBlock will be computed WITH attention mask.
Is it an error, or you did this on purpose? Will it harm the performance or boost it?
The Swin Transformer backbone is from official implementation for semantic segmentation without any modification:
I would suggest redirect your question to the original Swin Transformer authors.
In the original Swin implementation last BasicLayer with 2 SwinTransformerBlock's does not uses attention mask:
But your SwinTransformerBlock implementation does not uses such condition and first SwinTransformerBlock will be computed WITH attention mask.
Is it an error, or you did this on purpose? Will it harm the performance or boost it?