Tag audio at a higher resolution

adbrebs commented 1 year ago

Thank you for your great work and sharing it!

Do you have any recommendation to use your models to label audio at a higher resolution, say 1 sec or lower? Or even mel frame level?

I've tried applying your models on short windows but below 5 seconds, the results deteriorate a lot (for 1sec it seems to fail completely). I guess it's because the training AudioSet samples are ~10 seconds long.

I've also tried to modify the model to obtain frame-level predictions but it seems that they all use the "mlp" head and getting rid of the adaptative pooling would require a full retrain?

Thank you in advance!

fschmid56 commented 1 year ago

Hi,

you are right, the audio tagging performance deteriorates a lot if you try to label very short audio snippets. I would say this is to some extent natural as fine-grained labeling of short audios is difficult. However, MobileNet also downsamples the input strongly (x32 for our models) since this saves a lot of computation. For instance, if you try to label a one-second audio snippet the output of the conv. part before adaptive pooling will be of shape bs x channels x 4 x 4. With this small image sizes, padding seems to be responsible for the decreasing performance. If you just repeat the 1 second snippet 10 times you are getting reasonable predictions again.

I will run experiments with less down-sampling and with fully-convolutional classification heads. This should make it easier to get predictions for shorter audio snippets. If it works out I'll add the new pre-trained models to the repository.

adbrebs commented 1 year ago

Ok thank you for your answer, this makes sense.

I will run experiments with less down-sampling and with fully-convolutional classification heads. This should make it easier to get predictions for shorter audio snippets. If it works out I'll add the new pre-trained models to the repository.

Great, thank you!

adbrebs commented 1 year ago

Hi @fschmid56 have you had time to give it a try by any chance? If not I can give it a try but I won't be as fast you. Thank you!

Edit:

I will run experiments with less down-sampling and with fully-convolutional classification heads

To be clear, in my case, it's not so much about tagging short ~1sec file but rather about running sound event detection on a long file at a high precision (say 1s, instead of 10 sec). So I think that just retraining with fully-convolutional classification heads (without any down-sampling) would already be very helpful!

fschmid56 commented 1 year ago

Yes, I already gave it a shot. Using the fully convolutional head and less down-sampling didn't work so well out of the box. I probably need to tweek learning rate and other hyperparameters a bit. So far, I needed the limited available compute otherwise.

I started new experiments today switching only to the fully convolutional mode first and then reducing the down-sampling slowly in the following experiments.

To be clear, in my case, it's not so much about tagging short ~1sec file but rather about running sound event detection > on a long file at a high precision (say 1s, instead of 10 sec). So I think that just retraining with fully-convolutional
classification heads (without any down-sampling) would already be very helpful!

Let me understand that in more detail. If the network is given a 10 sec. audio file, the feature maps before adaptive average pooling will be of size t=32 and f=4. Using a fully convolutional head, you will therefore get an output of size (c=527, f=4, t=32). If you feed a longer audio sequence you will scale up 't' accordingly. Is it just about the convolutional head, or do you need 'f' and 't' in higher resolution? This would mean I have to reduce the strides (down-sampling) in the network. Currently the input spectrogram is down-sampled by a factor of 32 (5 layers with a stride of 2). Do you need that down-sampling factor also reduced?

adbrebs commented 1 year ago

Hi Florian, thank you for giving it a shot!

Is it just about the convolutional head, or do you need 'f' and 't' in higher resolution?

It is just about the convolutional head in my use case. I understand that padding might create issues at the beginning/end of a file (especially a short file) but it shouldn't be a big deal in my case since I deal with long recordings.

I've read your paper in detail, it's great work! I should have more time next week and I hope I can dig deeper in your code.

PS: do you still have the weights stored somewhere of the fully-conv head models that you trained? I would be able to test them immediately. Otherwise don't worry I will train some models next week.

fschmid56 commented 1 year ago

Hi Alexandre, thanks for the nice feedback!

I had some time and free resources today, so I started the experiments and I guess they will work out well this time. You can follow them if you like:

Experiments on W&B

I'll upload the weights as soon as they are finished.

fschmid56 commented 1 year ago

I've added two models to the github releases "mn10_as_fc_mAP_465.pt " and "mn10_as_fc_s2221_mAP_466.pt".

You should be able to run inference on them like this:

python inference.py --cuda --model_name=mn10_as_fc --audio_path="resources/metro_station-paris.wav" --head_type=fully_convolutional

python inference.py --cuda --model_name=mn10_as_fc_s2221 --audio_path="resources/metro_station-paris.wav" --head_type=fully_convolutional --strides 2 2 2 1

I will add more models next week. Also, I will attach the correct config to load pre-trained models to the model name. This is currently a bit of a pitfall, e.g. if you forget to specify the strides argument when you try to load a model trained with modified strides.

adbrebs commented 1 year ago

FYI:

I've taken the fully-conv models and removed the Adaptative Avg Pooling along the time dimension. Unfortunately they don't give good results at segmenting precisely the file.

For example: tags

https://user-images.githubusercontent.com/6939554/217709574-c427fb6b-110b-4947-9a22-961292f632c7.mp4

I guess it's probably due to the large receptive fields of MobileNetV3. Do you happen to know its value by any chance?

fschmid56 commented 1 year ago

Yes, I guess this is because of the huge receptive field. For the standard 'mn10' model the receptive field spans ~26k pixels. Even if the effective receptive field is much smaller, it still spans multiple seconds of audio. I strongly assume that this is why the model detects speech and siren almost everywhere.

Have you tried the model with reduced strides? Do the detected events have a shorter span over time?

What would be a desired receptive field size for you? I could imagine training a model with reduced depth/kernel sizes.

adbrebs commented 1 year ago

Have you tried the model with reduced strides? Do the detected events have a shorter span over time?

Yes both give similar results.

Even if the effective receptive field is much smaller, it still spans multiple seconds of audio.

Ok it makes sense then. It would be nice to have a function to compute this effective receptive field given the architecture.

What would be a desired receptive field size for you? I could imagine training a model with reduced depth/kernel sizes.

Thank you for proposing, something around 0.5s or 1s would be great! Let me know if I can help.

In the meantime, out of curiosity, I'm going to try the less ideal approach you suggested earlier: taking 1sec chunks and repeating them 10 times (not sure what's the right amount, ideally the receptive field) before feeding them to some models with the "mlp" head.

fschmid56 commented 1 year ago

The easiest solution I could think of is to set some kernels to 1 in the function __mobilenet_v3conf in MobileNetV3.py and retrain on AudioSet. The next simplest thing is to remove entire blocks from the config. If you have some time and resources, you could try that. I'm currently busy with other stuff, so it will take me a bit but I'm also planning to experiment with that.

adbrebs commented 1 year ago

Makes sense, thank you for the suggestions! I'm also busy with other projects at the moment but will give it a try when I find time - I'll keep you posted if I get any good results.

RicherMans commented 1 year ago

Hey there @adbrebs @fschmid56 , I stumbled upon this thread by chance and thought that I might add some insight into the core problems of Audio tagging at "finer" resolutions. First of all I'd like to thank Florian @fschmid56 for this awesome work!

I'd like to comment on the following issues of imprecise time-stamps.

I've taken the fully-conv models and removed the Adaptative Avg Pooling along the time dimension. Unfortunately they don't give good results at segmenting precisely the file.

Yes, I guess this is because of the huge receptive field. For the standard 'mn10' model the receptive field spans ~26k pixels. Even if the effective receptive field is much smaller, it still spans multiple seconds of audio. I strongly assume that this is why the model detects speech and siren almost everywhere. What would be a desired receptive field size for you? I could imagine training a model with reduced depth/kernel sizes.

That is to be expected for these types of models, since they are not trained to provide precise time-stamps, due to their design ( not necessarily due to the receptive field ). The main problem for the provided CNN's is the adaptive2d-avgpool operation employed. Here Florian merges all subsamples time x frequency features of size i.e., T/32 \times F/32 into a single embedding that is then send to the classifier. This is a standard procedure for most image-classification models, since one does not want the model to "learn" the position of a specific object in the image, it shouldnt matter anyway.

However, for audio classification, this 2d pooling operation is less reasonable since the model is trained to correlate say low-frequency information from frame 0 and high-frequency information from say the frame at 10s. Thus it can happen that the model actually "confuses" high-frequency information with low-frequency one and probabiltiies are "smeared out" over the entire time duration which can be observed in this post ( check out Foghorn and Bass guitar):

FYI:

I've taken the fully-conv models and removed the Adaptative Avg Pooling along the time dimension. Unfortunately they don't give good results at segmenting precisely the file.

For example:

tags.mp4 I guess it's probably due to the large receptive fields of MobileNetV3. Do you happen to know its value by any chance?

The overall 2d pooling works as long as your training and testing durations are somewhat similar and might be better to achieve a "higher mAP".

Onto this end, I'd like to advocate our previous work here pseudo strong labels, since we encountered the exact some problem. Thus - at least for us - we generally avoid doing global average pooling and always prefer "decision-level" one, i.e., first average the frequency dimension, then obtain your probabilities and then average these, such that your model never correlates high-frequency and low-frequency information from completely different time-frames. I used my simple mobilenet trained in the described fashion and obtained the following result on your sample:

The code for this picture is:

wget https://raw.githubusercontent.com/RicherMans/PSL/main/src/models.py
wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/class_labels_indices.csv

import seaborn as sns
import matplotlib.pyplot as plt
import models
import numpy as np
import pandas as pd
import librosa
import torch

maps = pd.read_csv('class_labels_indices.csv',sep=',').set_index('index')['display_name'].to_dict()
data, sr = librosa.load('./217709574-c427fb6b-110b-4947-9a22-961292f632c7.mp4', sr = 16000)
mdl = models.MobileNetV2_DM()
mdl_state = torch.hub.load_state_dict_from_url('https://zenodo.org/record/6003838/files/mobilenetv2_mAP40_53.pt?download=1')
mdl.load_state_dict(mdl_state)

mdl.eval()
with torch.no_grad():
    y, y_time = mdl(torch.as_tensor(data).unsqueeze(0))
y_time = y_time.squeeze(0)

idxs = y_time.topk(3).indices.numpy()
scores = y_time.topk(3).values.numpy()
time_arr = np.arange(0, data.shape[-1]/sr, 0.32)

res = []
pred_names = []
for i in range(len(idxs)):
    names = [maps[f] for f in idxs[i]]
    for j in range(len(names)):
        res.append({'score':scores[i][j], 'name':names[j], 'time':time_arr[i]})

r = pd.DataFrame(res)
r['name'] = r['name'].astype('category')

plt.figure(figsize=(14,8))
sns.lineplot(data=r,x='time',y='score',hue='name')
plt.show()

Hope that I can help!

fschmid56 commented 1 year ago

Hey @RicherMans,

thanks for the additional input on this matter!

However, for audio classification, this 2d pooling operation is less reasonable since the model is trained to correlate say low-frequency information from frame 0 and high-frequency information from say the frame at 10s. Thus it can happen that the model actually "confuses" high-frequency information with low-frequency one and probabilities are "smeared out" over the entire time duration which can be observed in this post ( check out Foghorn and Bass guitar):

I do understand that it is problematic to mix low- and high-level frequency information in general. Even applying the same conv. kernels to high- and low-freq. regions is not well justified in my opinion as objects in images are position-invariant, while this might not hold for patterns along the frequency dimension. What is not so obvious to me right now, is why it is especially problematic if you average that over time and how this problem causes smeared-out probabilities. If I train models with global channel pooling and have a limited receptive field, I should still get valid time information if I don't do the pooling over time at inference time, no?

I will definitely look deeper into your paper and the code as soon as I have time!

I used my simple mobilenet trained in the described fashion and obtained the following result on your sample

Have you tried using the KD approach from this repo together with decision-level pooling?

RicherMans commented 1 year ago

I do understand that it is problematic to mix low- and high-level frequency information in general. Even applying the same conv. kernels to high- and low-freq. regions is not well justified in my opinion as objects in images are position-invariant, while this might not hold for patterns along the frequency dimension. What is not so obvious to me right now, is why it is especially problematic if you average that over time and how this problem causes smeared-out probabilities.

To be honest I also thought like that before, but just during my research for PSL, I initially trained models with global average pooling as I did and obtained very very wrong results for sub-10s resolutions ( like high probabilities for say cat meowing, even though there is water in an audio clip).

If I train models with global channel pooling and have a limited receptive field, I should still get valid time information if I don't do the pooling over time at inference time, no?

As far as I understand it, since you pool your features in time-frequency, the resulting embedding is specifically in time-frequency space, not in an independent time/frequency space, which means you can't simply during inference now expect that this space can be disentangled to a time / frequency space ( like a spectrogram) If you do decision level pooling, the pooled (frequency ) embeddings are all within the same frequency-only space. These are then pooled over time, such that there is not a "mixup" of time and frequency information in your embeddings, which also allows them to be used to predict sub-scale (like 1s) audio tags.

In my point of view your embeddings are likely to be somewhat superior to time/frequency independent ones, since you contain more information within them. For training some other downstream tasks like in HEAR, it would seem to me that your embeddings should be better, but if you would one day need to do audio tagging for a real-world application that does have a shorter response time than 10s (that's super long btw), then try decision-level pooled approaches.

Have you tried using the KD approach from this repo together with decision-level pooling?

Surely tried but not with your provided code nor pretrained ensemble weights. My own baseline with imagenet pretraining for a decision-mean pooled mobilenetv2 is at 42.15, with 64 mels and a sampling rate of 16k. With your proposed approach and some ViT teacher models I can get up to ~43.51 so far, but might get better results just by simply training longer, so thanks for the work @fschmid56 ! I actually would like to use your logits , but I can't since there are no "filenames" provided in your saved object. If possible could you provide the corresponding filenames or youtube-ids of each element in your saved object?

Thanks again!

fschmid56 commented 1 year ago

Okay, thanks for the input, I'll definitely have a closer look at this in the near future. In general, I would like to experiment more with audio-specific architectural components as I still find it a bit frustrating that vision architectures work so well out of the box without significant adaptation to audio. Decision-level pooling is now definitely on my list.

I actually would like to use your logits, but I can't since there are no "filenames" provided in your saved object. If possible could you provide the corresponding filenames or youtube-ids of each element in your saved object?

This is already the third request regarding this. I'll put it on top of my list for after 20th Feb. (current Eusipco deadline).

RicherMans commented 1 year ago

Okay, thanks for the input, I'll definitely have a closer look at this in the near future. In general, I would like to experiment more with audio-specific architectural components as I still find it a bit frustrating that vision architectures work so well out of the box without significant adaptation to audio. Decision-level pooling is now definitely on my list.

Agreed, its a bit of a problem that many architectures from vision do not directly work.

This is already the third request regarding this. I'll put it on top of my list for after 20th Feb. (current Eusipco deadline).

Thanks and good luck for that conference! We maybe will see each other in June in ICASSP :)

fschmid56 commented 1 year ago

I've uploaded the file fname_to_index, which contains a dict converting the file IDs to the indices in the predictions file. I tried to make it compatible with the IDs provided in the official csv files.

We maybe will see each other in June in ICASSP :)

For sure! :-)

adbrebs commented 1 year ago

Hi @RicherMans, thank you for taking the time to write and sharing some insights!

In my use case, I would need a resolution of around 0.5-1s, so a receptive field of around ~1sec max. I think my best bet is to slightly change @fschmid56 mobilenet architecture to reduce the receptive field (and remove the global avg pooling) and retrain it with @fschmid56 's Transformer KD.

Unfortunately I have a hard time downloading the data (PaSST scripts fail). By any chance, does one of you have it stored somewhere?

fschmid56 commented 1 year ago

Hi @adbrebs,

as far as I know, we are not allowed to distribute AudioSet because of possible copyright issues (this is why AudioSet is available as a set of URLs to download it yourself). I can only tell you that we got AudioSet by using the instructions in the PANNs repo. I hope this somehow helps.

Best, Florian

adbrebs commented 1 year ago

Ok thank you @fschmid56. I've managed to get it.

RicherMans commented 1 year ago

Hey @adbrebs , just for the record here and as an "advertisement" for some of our work. I recently released the source for our streaming audio transformers.

The goal of that work is to further improve "high resolution" performance, while also being capable of tracking long-range events. For example, if you deploy a tagger on a static web camera and want to for example notify a user that he forgot to turn off his water faucet, you would need to track the sound event over a prolonged period of time, instead of only say 2 s/10s.

As a side feature of SAT, they can somewhat effectively track events up to some very small delay of 160ms. I again used your provided sample above and ran SAT_T_1s (5M Params, Streamable, mAP ~ 40.x) with chunks of 160ms, 320ms and 480ms. I got the following top-1 results:

480ms: plot_sat_t_1s_chunk480ms

With scores:

320ms: plot_sat_t_1s_chunk320ms

With scores:

160ms: plot_sat_t_1s_chunk160ms

With scores:

fschmid56 / EfficientAT

Tag audio at a higher resolution #3