fschmid56 / EfficientAT

This repository aims at providing efficient CNNs for Audio Tagging. We provide AudioSet pre-trained models ready for downstream training and extraction of audio embeddings.
MIT License
229 stars 43 forks source link

Is the Fname to index file correct? #7

Closed RicherMans closed 1 year ago

RicherMans commented 1 year ago

Hey Florian, I checked out your new fname_to_index file and trained some models, but performance is extremely bad.

Then just proceeded to print some scores from your provided fname_to_idx.pkl file and check out the ground truth.

I used this simple code to map your indexes to the fnames:

import torch
import numpy as np
import pandas as pd

clmaps = pd.read_csv('./class_labels_indices.csv').set_index('index')['display_name'].to_dict()
data = np.load('./passt_enemble_logits_mAP_495.npy', allow_pickle=True)
fnames_to_idx = np.load('./fname_to_index.pkl', allow_pickle=True)

idx_to_fnames = {v: k for k, v in fnames_to_idx.items()}

for idx, fname in idx_to_fnames.items():
    values, idxs = torch.as_tensor(data[idx], dtype=torch.float32).sigmoid().topk(5)
    print(f" ==== {fname} ==== ")
    names = [clmaps[i] for i in idxs.numpy()]
    for score, clname in zip(values, names):

        print(f"{clname:<10} {score:<.3f}")

Some of the outputs are:

 ==== 09c885WMtMw ==== 
Animal     0.914
Dog        0.887
Domestic animals, pets 0.847
Bark       0.643
Bow-wow    0.293

The ground truth for that file however, is "Music", you can check out the source at:

https://youtu.be/09c885WMtMw?t=80

Another sample is:

 ==== 09bFB0X-8QY ==== 
Speech     0.917
Female speech, woman speaking 0.598
Narration, monologue 0.406
Child speech, kid speaking 0.050
Inside, small room 0.028

Which can be viewed here: https://youtu.be/09bFB0X-8QY?t=16

I'm reasonably confident that your fname_to_index is somewhat wrong, could you maybe check it out if its the case?

EDIT:

In the above code snippet, the following will throw an error, which means there are some duplicate indexes:

assert len(fnames_to_idx) == len(idx_to_fnames)

Kind Regards, Heinrich

fschmid56 commented 1 year ago

Hey Heinrich,

thanks for pointing this out so quickly. Indeed there was a problem in the _fname_toidx.pkl file, which was caused by the concatenation of the balanced and unbalanced subsets of the datasets. I apologize for the inconvenience, the problem should be fixed now.

Best, Florian

RicherMans commented 1 year ago

Hey Florian, thanks for the quick answer, the labels all look good now and I can train my models.

Best, Heinrich

fschmid56 commented 1 year ago

I would be interested in how well they work for you compared to what you were previously using for KD if you have some results.

RicherMans commented 1 year ago

Hey Florian, so it works I guess as advertised :).

I generally only use MobileNetV2 (with decision mean pooling) with 64 mels and a sampling rate of 16k, which is quite different from your teacher setting. Because of your paper, I also used some ViT models of my own, which were trained using the exact same features as above. I also ran experiments for global average pooling (GAP) for MBv2 just for comparison's sake. Results are:

Model Teacher mAP
MBv2-DM None 42.15
MBv2-DM My-ViT 43.51
MBv2-GAP EfficientKD 42.15
MBv2-DM EfficientKD 43.53

So results look alright, given the small size of the MobileNetV2 model, so thanks for that.