dberghi / AV-SELD

Python implementation of the paper "Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection"
MIT License
15 stars 2 forks source link

How to set the learning rate, and whether the visual coding contains a conformer structure #2

Closed lws1234 closed 1 month ago

lws1234 commented 6 months ago

Hello, thank you very much for the code, which has been an important help for me to study the problem of sound event localization and detection. I have read your paper so far, and the implementation of the paper code, there are a few questions to ask you, you mentioned in the paper that the learning rate is a floating value, I set it according to 0.0001, and it has not been able to achieve the result in your paper, I want to know how much you set for each system, in the visual encoder, I did not find a conformer structure in your code, in your paper I saw that the resnet-conformer was written. We look forward to hearing from you and I would appreciate it if you could reply.

dberghi commented 6 months ago

Hi there, many thanks for your interest in our work!

Here is how I set the learning rate for each system and the best epoch:

Vis: I3D and fusion: CMAF -> lr=0.00001, ep. 18 Vis: I3D and fusion: AV-Conf -> lr=0.00003, ep. 5 Vis: ResNet-Conf and fusion: CMAF -> lr=0.00008, ep.9 Vis: ResNet-Conf and fusion: AV-Conf -> lr=0.00005, ep.5 Vis: Both and fusion: AV-Conf -> lr=0.00005, ep 8

Each system used the CNN-Conformer as the audio encoder. I trained all the systems for 50 epochs and then I took the best epoch. However, as you can see, the best results are often achieved within the first 10 epochs. So, you can train for fewer epochs and save time.

You are totally right about the ResNet-Conformer! When I polished the code before releasing it I omitted the conformer module that I used for the results in the paper after ResNet50. I have now updated and corrected the code. Apologies for that and thank you for spotting it! Visual features extracted with ResNet50 are first extracted and stored during the pre-processing phase. Then, the conformer module is applied during training in models/AV_SELD_model.py (if you have set 'resnet' as visual_encoder_type in the configuration file).

Since before releasing the code I have been polishing and optimizing it, you might not be able to get identical results. However, you should be able to achieve a similar performance.

Hope this helps :)

xueshuggbond commented 6 months ago

Hello, thank you very much for your code, it is of great help to me in studying sound event localization and detection problems. I have read your paper so far, as well as the implementation of the paper code. I have a question for you. My h5 file is reading very slowly. Is there something I didn’t handle well? After preprocessing The amount of data has reached an astonishing 500 gigabytes. Is there something wrong with my processing? My CPU memory is 32G, so the processing is still very slow.

dberghi commented 5 months ago

Hello and sorry for the late reply.

Yes, I confirm that the features will require about 500GB of free space. Yesterday, I tried to recompute the audio and visual features (with resnet as a visual encoder) and it requires 493GB (train_dataset.h5 is 480GB). The reason for such a large output is that the features are stored after the audio-visual channel swap augmentation is applied. So, the dataset size is increased by a factor of 8. The advantage is that the transformations are performed only once and the training will then run way faster! If you had to compute the visual transformation as you train, it would require a lot of time.

Let me understand your problem, are you trying to train on a CPU? Don't you have a GPU?