Question about reshape log_mel to img size

Yes, I can explain to you. So reshape_wav2img is to reshape the Mel-spectrorgam into the size that can be sent to the HTS-AT. Since HTS-AT is inherited from Swin-Transformer, in order to use the Swin-transformer pretrained model (one of our experiments). We need to make the Mel-spectrogram input (1024, 64) as (256, 256).

In order to do this, we need to care about what the first 256 means and what the second 256 means for the original (1024,64) audio Mel-spec. As you might know in our paper, the order of the sequence to the transformer is important (time->frequency->window). That is what reshape_wav2img does for the reshaping.

T and F are the current audio Mel-spec input, and the Target_T=256 and Target_F=256. Sometimes your input might not be (1024,64), e.g. only (1000, 64). Then we need to interpolate 1000 to 1024, the same as the frequency axis.

Another way can be do the repeating. For example, if you only has (512,64), you can repeat it twice to make it (1024, 64). That is what repeat_wav2img does.

RetroCirce / HTS-Audio-Transformer

Question about reshape log_mel to img size #15