RetroCirce / HTS-Audio-Transformer

The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"
https://arxiv.org/abs/2202.00874
MIT License
341 stars 62 forks source link

cyclic window shifting in the (256,256) tensor #47

Open tsw123tsw opened 11 months ago

tsw123tsw commented 11 months ago

Hi, Awesome repo. I have a question regarding the architecture are token interaction. Don't you think the way HTSAT creates (256, 256) tensor from (1024,64) spectrogram causes problematic token interaction when cyclic window shifting? What I understood is that you cut the spectrogram (1024,64) into 4 PIECES along dim=0 (256,64) each. Later these 4 are concatenated along dim=1 resulting in the final tensor of shape (256,256). On this when you do cyclic window shift, results in window comprises of tokens from two different PIECES, that is some from low-frequency region of PIECE 1 and some from high frequency region from PIECE 2.

RetroCirce commented 9 months ago

Hi,

Sorry for the late reply, I was busy with other stuff this quarter. You ask a good question that if the cyclic window shifting will cause the overlapping or wrong information loss of the feature during the downsampling process.

My answer is not. And it is not because we downsample it by 2 each time and with 2 x 2 x 2 = 8 three times in total. In this case, all features "at the edge" of each (256, 64) piece will not share any wrong information into other piece because the downsample rate 8 is not big enough to compress (256, 64) into (1,1). This is considered when I did the project and that is why I think it would be not a problem.

But if you increase your downsample rate, it should be a problem. In this case, I think giving up converting from (1024, 64) to (256, 256) and directly process (1024, 64) shape will be an option. In all, the reason why I do the conversion is because we need to use the swin-transformer checkpoint.