RetroCirce / HTS-Audio-Transformer

The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"
https://arxiv.org/abs/2202.00874
MIT License
341 stars 62 forks source link

Question about reshape log_mel to img size #15

Closed lyc1993 closed 2 years ago

lyc1993 commented 2 years ago

Hi,

Could you explain a little more about how reshape_wav2img function works?

Why could we simply interpolate T and F to spec_size? What is target_T and target_F represent here?

A detailed explanation about the reshape process is appreciated. Thank you!

RetroCirce commented 2 years ago

Yes, I can explain to you. So reshape_wav2img is to reshape the Mel-spectrorgam into the size that can be sent to the HTS-AT. Since HTS-AT is inherited from Swin-Transformer, in order to use the Swin-transformer pretrained model (one of our experiments). We need to make the Mel-spectrogram input (1024, 64) as (256, 256).

In order to do this, we need to care about what the first 256 means and what the second 256 means for the original (1024,64) audio Mel-spec. As you might know in our paper, the order of the sequence to the transformer is important (time->frequency->window). That is what reshape_wav2img does for the reshaping.

T and F are the current audio Mel-spec input, and the Target_T=256 and Target_F=256. Sometimes your input might not be (1024,64), e.g. only (1000, 64). Then we need to interpolate 1000 to 1024, the same as the frequency axis.

Another way can be do the repeating. For example, if you only has (512,64), you can repeat it twice to make it (1024, 64). That is what repeat_wav2img does.