cyclic window shifting in the (256,256) tensor

RetroCirce / HTS-Audio-Transformer

The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"

MIT License

341 stars 62 forks source link

Hi,

Sorry for the late reply, I was busy with other stuff this quarter. You ask a good question that if the cyclic window shifting will cause the overlapping or wrong information loss of the feature during the downsampling process.

My answer is not. And it is not because we downsample it by 2 each time and with 2 x 2 x 2 = 8 three times in total. In this case, all features "at the edge" of each (256, 64) piece will not share any wrong information into other piece because the downsample rate 8 is not big enough to compress (256, 64) into (1,1). This is considered when I did the project and that is why I think it would be not a problem.

But if you increase your downsample rate, it should be a problem. In this case, I think giving up converting from (1024, 64) to (256, 256) and directly process (1024, 64) shape will be an option. In all, the reason why I do the conversion is because we need to use the swin-transformer checkpoint.

RetroCirce / HTS-Audio-Transformer

cyclic window shifting in the (256,256) tensor #47