RetroCirce / HTS-Audio-Transformer

The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"
https://arxiv.org/abs/2202.00874
MIT License
344 stars 62 forks source link

Question regarding framewise output timesteps #4

Closed leanderme closed 2 years ago

leanderme commented 2 years ago

Hi, thank you for sharing this!

I'm trying to use the HTSAT for SED with strong labels, i.e. with known onset and offset times. I have found that with the default config, the input shape is (batch_size, 1001, 527) in the case of AudioSet, whereas the framewise output results in (batch_size, 1024, 527) as implemented in the method foward_features of the HTSAT_Swin_Transformer class by:

if self.config.htsat_attn_heatmap:
    fpx = interpolate(torch.sigmoid(x).permute(0,2,1).contiguous() * attn, 8 * self.patch_stride[1]) 
else: 
    fpx = interpolate(torch.sigmoid(x).permute(0,2,1).contiguous(), 8 * self.patch_stride[1]) 

Now I wonder what the best strategy would be in case of computing the loss between the framewise output and the target labels woud be. Normally, I just would generate a target label with the same timestep size of the input spectrogram and then optimize for the BCE.

So the question is: Would you, in this case, resize the framewise output to the same size as the input timesteps and then proceed as described above? Or is there a better way?

To be more specific, would something like this make sense:

def out_frames(sec, timesteps = 1024, max_seconds = 10):
    return sec * (timesteps / max_seconds)

label       = np.zeros((1024, 527))
tmp_data    = np.array([ # sample data holding event labels for a given audio file
    [0,   1.5, 0], # onset, offset, class_id
    [9.85, 10, 1]
])

frame_start = np.floor(out_frames(tmp_data[:, 0])).astype(int)
frame_end   = np.ceil(out_frames(tmp_data[:, 1])).astype(int)
se_class    = tmp_data[:, 2].astype(int)

for ind, val in enumerate(se_class):
    label[frame_start[ind]:frame_end[ind], val] = 1 # 1 for active

"""
Resulting in:
array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])
"""

If the timesteps of the spectrogram and the featurewise output would be the same, I normally would calcultate frame_start and frame_end by

frame_start = np.floor(tmp_data[:, 0] * sr / hop_len).astype(int)
frame_end  = np.ceil(tmp_data[:, 1] * sr / hop_len).astype(int)

In the image below, the timesteps mismatch is visualized according to the input spectrogram. image

Any help is greatly (!) appreciated and thanks again for sharing your code!

RetroCirce commented 2 years ago

Hi,

Yes, as you mentioned, the whole pipeline for processing the label is roughly as: (1001,64) -> (1024, 64) -> (small dim, 527) -> (1024, 527)

So in the beginning, you need to decide how to make (1001, 64) -> (1024,64) of your original input. In my implementation in htsat.py, at reshape_wav method, there are two ways: (1) pad 0 from 1001 to 1024, or (2) do the interpolation. Now I think it is using the (2).

So if you use the (1) padding, that means from the frame 1001 to 1024 is just the 0 data (i.e. no use). I would recommend after you get the (1024, 527) output from HTS-AT, you can just trim it to 1001, and compute the BCE loss.

If you use the (2) interpolation, that means the original input is interpolated from 1001 to 1024, all the frames contain the audio information, and you can directly do like your out_frames method, as you also map the strong label from 1001 to 1024, which is a correct way.

Between these two methods, I would recommend the (2), and your out_frames definitely does the right thing.