Closed Moadab-AI closed 2 years ago
this has always been the case: https://github.com/turpaultn/dcase20_task4/blob/master/baseline/models/CRNN.py @turpaultn do you have any intuition ?
My intuition is that if you have a sound event that occurs for the whole clip you don't want to compute the softmax across frames. Basically attention block is like looking at each frames what is the most likely class to occur, then you gate that using the strong posteriors and sum over the whole clip.
Applying softmax /Normalizing across classes actually hinders the models ability in detecting polyphony as it promotes a single event per temporal segment.
The very introduction of normalization across time for weighted pooling which I think it was originally proposed in this work, :
https://arxiv.org/pdf/1610.01797.pdf
was to help the detection of short events which their detection may get lost with vanilla average pooling, otherwise it wouldnt really much affect the detection of long events as they will have multiple positive classified segments across the recording. not to mention that also in the case of long events the output of softmax could potentially as well be uniform-like across the width of the event with values close to each other. Frankly, I think its just a mistake in chosing the correct dimension in the Softmax module.
Isn't attention pooling from https://arxiv.org/pdf/1711.00927.pdf ?
Isn't attention pooling from https://arxiv.org/pdf/1711.00927.pdf ?
No this one actually came later.In section 2.3 they actually cite the JDC paper above:
... . Equation (3) has the same form as our joint detection-classification (JDC) model [14] and our attention model [6] proposed for audio tagging and sound event detection
Can you double-check if they apply softmax over the classes or frames in this work and also in [6] which seems closer to what it is done in CRNN ? [14] is a bit different from CRNN implementation.
Seems that softmax is applied over classes. Have you tried running baseline code with softmax over frames instead ?
Hi there, Very sorry about the late answer. This is not an easy question. I've actually tried the attention on both dimensions. On classes axis:
Intuitively, I would say that frame attention is very good if you want to do some tagging (sound event recognition). However, if you want to do sound event detection, doing class attention helps.
Very happy to discuss about it, it is not a simple question, and also happy to discuss with other people who have thought and tried the different ways. (maybe we could include @qiuqiangkong, I wouldn't be surprised that we already talked about that during some conferences)
Seems that in literature there is no agreement on this. [6] uses class wise to me but I am not sure. There is a previous work that uses frame-wise but I cannot find it anymore.
Did you get the answer to your question @Moadab-AI ?
Just realized something in the base model: https://github.com/DCASE-REPO/DESED_task/blob/master/desed_task/nnet/CRNN.py
Could you please elaborate why in the attention block you are applying softmax across the classes dimension? I believe the softmax here should be applied across the time axis.