Open tsw123tsw opened 11 months ago
Hi,
Sorry for the late reply, I was busy with other stuff this quarter. You ask a good question that if the cyclic window shifting will cause the overlapping or wrong information loss of the feature during the downsampling process.
My answer is not. And it is not because we downsample it by 2 each time and with 2 x 2 x 2 = 8 three times in total. In this case, all features "at the edge" of each (256, 64) piece will not share any wrong information into other piece because the downsample rate 8 is not big enough to compress (256, 64) into (1,1). This is considered when I did the project and that is why I think it would be not a problem.
But if you increase your downsample rate, it should be a problem. In this case, I think giving up converting from (1024, 64) to (256, 256) and directly process (1024, 64) shape will be an option. In all, the reason why I do the conversion is because we need to use the swin-transformer checkpoint.
Hi, Awesome repo. I have a question regarding the architecture are token interaction. Don't you think the way HTSAT creates (256, 256) tensor from (1024,64) spectrogram causes problematic token interaction when cyclic window shifting? What I understood is that you cut the spectrogram (1024,64) into 4 PIECES along dim=0 (256,64) each. Later these 4 are concatenated along dim=1 resulting in the final tensor of shape (256,256). On this when you do cyclic window shift, results in window comprises of tokens from two different PIECES, that is some from low-frequency region of PIECE 1 and some from high frequency region from PIECE 2.