DCASE-REPO / DESED_task

Domestic environment sound event detection task
MIT License
126 stars 67 forks source link

Softmax in attention block #55

Closed Moadab-AI closed 2 years ago

Moadab-AI commented 2 years ago

Just realized something in the base model: https://github.com/DCASE-REPO/DESED_task/blob/master/desed_task/nnet/CRNN.py

Could you please elaborate why in the attention block you are applying softmax across the classes dimension? I believe the softmax here should be applied across the time axis.

 self.softmax = nn.Softmax(dim=-1)

if self.attention:
            sof = self.dense_softmax(x)  # [bs, frames, nclass]
            if not pad_mask is None:
                sof = sof.masked_fill(pad_mask.transpose(1, 2), -1e30)  # mask attention
            sof = self.softmax(sof)
            sof = torch.clamp(sof, min=1e-7, max=1)
            weak = (strong * sof).sum(1) / sof.sum(1)  # [bs, nclass]
popcornell commented 2 years ago

this has always been the case: https://github.com/turpaultn/dcase20_task4/blob/master/baseline/models/CRNN.py @turpaultn do you have any intuition ?

My intuition is that if you have a sound event that occurs for the whole clip you don't want to compute the softmax across frames. Basically attention block is like looking at each frames what is the most likely class to occur, then you gate that using the strong posteriors and sum over the whole clip.

Moadab-AI commented 2 years ago

Applying softmax /Normalizing across classes actually hinders the models ability in detecting polyphony as it promotes a single event per temporal segment.

The very introduction of normalization across time for weighted pooling which I think it was originally proposed in this work, :
https://arxiv.org/pdf/1610.01797.pdf

was to help the detection of short events which their detection may get lost with vanilla average pooling, otherwise it wouldnt really much affect the detection of long events as they will have multiple positive classified segments across the recording. not to mention that also in the case of long events the output of softmax could potentially as well be uniform-like across the width of the event with values close to each other. Frankly, I think its just a mistake in chosing the correct dimension in the Softmax module.

popcornell commented 2 years ago

Isn't attention pooling from https://arxiv.org/pdf/1711.00927.pdf ?

Moadab-AI commented 2 years ago

Isn't attention pooling from https://arxiv.org/pdf/1711.00927.pdf ?

No this one actually came later.In section 2.3 they actually cite the JDC paper above:

... . Equation (3) has the same form as our joint detection-classification (JDC) model [14] and our attention model [6] proposed for audio tagging and sound event detection

popcornell commented 2 years ago

Can you double-check if they apply softmax over the classes or frames in this work and also in [6] which seems closer to what it is done in CRNN ? [14] is a bit different from CRNN implementation.

popcornell commented 2 years ago

Seems that softmax is applied over classes. Have you tried running baseline code with softmax over frames instead ?

turpaultn commented 2 years ago

Hi there, Very sorry about the late answer. This is not an easy question. I've actually tried the attention on both dimensions. On classes axis:

Intuitively, I would say that frame attention is very good if you want to do some tagging (sound event recognition). However, if you want to do sound event detection, doing class attention helps.

Very happy to discuss about it, it is not a simple question, and also happy to discuss with other people who have thought and tried the different ways. (maybe we could include @qiuqiangkong, I wouldn't be surprised that we already talked about that during some conferences)

popcornell commented 2 years ago

Seems that in literature there is no agreement on this. [6] uses class wise to me but I am not sure. There is a previous work that uses frame-wise but I cannot find it anymore.

turpaultn commented 2 years ago

Did you get the answer to your question @Moadab-AI ?