MTG / essentia

C++ library for audio and music analysis, description and synthesis, including Python bindings
http://essentia.upf.edu
GNU Affero General Public License v3.0
2.86k stars 534 forks source link

Lost in the documentation - Get the current best model - classification head combination #1441

Open csipapicsa opened 1 month ago

csipapicsa commented 1 month ago

Hi,

I'm trying to identify the most accurate models for classifying (let's say) 'happy' moods in music. I've noticed some inconsistencies in the model listings from different years and I'm a bit confused about how to proceed.

From what I gathered, models listed in a 2020 post on the Essentia Labs website point to specific TensorFlow models for mood classification: 2020 TensorFlow Models Released - Essentia Labs

Additionally, I found a specific model for 'happy' mood classification detailed here: Mood Happy Classifier - Musicnn MSD

However, in a more recent listing from 2022 on the main Essentia models page, there seems to be an update or different models used: Essentia Models 2022 - Mood Happy

I also noticed that the same embedding model is used for different tasks, which is adding to my confusion:

embedding_model = TensorflowPredictVGGish(graphFilename="audioset-vggish-3.pb", output="model/vggish/embeddings")

Could someone clarify which models and classification heads are currently considered the most accurate for detecting like 'happy' moods in music? Any guidance on how to effectively select and use these models would be greatly appreciated. I want to use more models for my master thesis so any help would be helpful!

Thank you!

palonso commented 1 month ago

Hi @csipapicsa, according to internal metrics, the happy classifier based on discgos-effnet embeddings achieved higher performance than the others. This is a code snippet to get predictions with this model:

from essentia.standard import MonoLoader, TensorflowPredictEffnetDiscogs, TensorflowPredict2D

audio = MonoLoader(filename="audio.wav", sampleRate=16000, resampleQuality=4)()
embedding_model = TensorflowPredictEffnetDiscogs(graphFilename="discogs-effnet-bs64-1.pb", output="PartitionedCall:1")
embeddings = embedding_model(audio)

model = TensorflowPredict2D(graphFilename="mood_happy-discogs-effnet-1.pb", output="model/Softmax")
predictions = model(embeddings)

Remember that to run this code, you must download the model files (*.pb) and set the graphFilename parameter accordingly.

Additionally, note that some of our auto-tagging classifiers (MTG-Jamendo, MSD, MTT) also predict the happy tag. You can experiment with these predictions and choose the most suitable for your use case.

Best, Pablo.

csipapicsa commented 1 month ago

Hi @palonso,

Thanks for the answer! So are the models on this page the most updated ones?

https://essentia.upf.edu/models.html

If I want to know which one is the best, I assume I need to check the metadata for each classifier to see their accuracy, right? Is there a summary page available for them?

Another question: Can the same embedding model be used for several tasks? For example, if I load "discogs-effnet-bs64-1.pb," do I only need to change the model head, which is usually quite light (around 500kb, kinda)?