Labbeti / conette-audio-captioning

CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding
https://arxiv.org/pdf/2309.00454.pdf
11 stars 0 forks source link

Model is not learning on Clotho datasets #5

Closed MoayedHajiAli closed 5 months ago

MoayedHajiAli commented 5 months ago

Hello,

Thank you for the great work and implementation. I have followed the instructions exactly in the Readme to train on Clotho dataset through conette-train expt=[clotho_cnext_bl] pl=baseline The only difference is that I specified cnext_bl_path=$HOME/.cache/torch/hub/checkpoints/convnext_tiny_465mAP_BL_AC**_70kit.pth**, which is the checkpoint that is downloaded when running prepare

However, the metrics are much worse. The fense score is 0.230 compared with the one reported in the paper 0.516, and the SPIDEr score is 0.085 compared with the 0.301. Do you know what might be the issue?

Thank you!

Labbeti commented 5 months ago

Hi! Thanks for reporting this. Seems like something is weird with the pretrained Convnext model, maybe the downloaded file is wrong. Did you run the AAC model during 400 epochs with the default hyperparameters?

MoayedHajiAli commented 5 months ago

Hello, Thank you for your prompt response. Yes I run the AAC model with the default hyper-parameters for 400 epochs. I will test with your fix and let you know.

MoayedHajiAli commented 5 months ago

Hello, I have tried training again on the dev branch after your fix. However, unfortunately, the problem is still there. Both validation and training loss are decreasing. Yet, they stay high in values and therefore the metrics are poor. I am looking forward to hear your thoughts about this. Thank you for your help.

Labbeti commented 5 months ago

Sorry, the commit only fixed an error occurring while loading Convnext, and I forgot you would have this message here. The problem hasn't been fully resolved yet, and I am still looking into it.

MoayedHajiAli commented 5 months ago

No worries. Thank you very much for your help and I am looking forward for the fix.

MoayedHajiAli commented 5 months ago

Hello, Do you any updates on this? Also, do you know if the issue is only with the baseline model, or the conette model too (e.g train CoNeTTE on AC+CL+MA+WC, specialized for CL)?

Labbeti commented 5 months ago

The problem is linked to the CNext model not loading correctly, and the wrong checkpoint being loaded. This means that the pre-processed audio features in the HDF files are invalid, which has an impact on the CNext-trans and CoNeTTE models. I have fixed the loading, and I am now checking whether the audio features are properly calculated.

MoayedHajiAli commented 5 months ago

Hello, Thank you very much. Yes! this seems to fix the issue. I can confirm that training on CL and testing on CL obtained fense 0.508 compared to 0.516 in the paper, and spider 0.296, compared to 0.301 in the paper, which is very close.

Thank you again. I am closing the issue for now.