MTG / essentia

C++ library for audio and music analysis, description and synthesis, including Python bindings
http://essentia.upf.edu
GNU Affero General Public License v3.0
2.84k stars 530 forks source link

A how-to is needed for working with Loudness (and other algos) #1329

Closed dgoldenberg-audiomack closed 7 months ago

dgoldenberg-audiomack commented 1 year ago

Could someone pls suggest 'the right way' of how Loudness is supposed to work?

I've tried 4 different combos of EasyLoader with Loudness and none of them work, same with MonoLoader.

# --- from essentia.standard import EasyLoader
# --- from essentia.streaming import Loudness
# OR
# --- from essentia.standard import EasyLoader, Loudness

[   INFO   ] MusicExtractorSVM: no classifier models were configured by default
Traceback (most recent call last):
  File "loudness.py", line 19, in <module>
    loader.audio >> loudness.signal
AttributeError: 'Algo' object has no attribute 'audio'

------------------------------------------------------------------------------------------

# --- from essentia.streaming import EasyLoader
# --- from essentia.standard import Loudness

Traceback (most recent call last):
  File "loudness.py", line 20, in <module>
    loader.audio >> loudness.signal
AttributeError: 'Algo' object has no attribute 'signal'

------------------------------------------------------------------------------------------

# --- from essentia.streaming import EasyLoader, Loudness
Traceback (most recent call last):
  File "loudness.py", line 19, in <module>
    loader.audio >> loudness.signal
  File "/home/airflow/.local/lib/python3.7/site-packages/essentia/streaming.py", line 60, in __rshift__
    right.input_algo, right.name)
TypeError: While connecting EasyLoader::audio to Loudness::signal:
Error when checking types. Expected: std::vector<Real>, received: Real

The code is super simple, just along the lines of:

import sys

import essentia
from essentia.streaming import EasyLoader, Loudness

if len(sys.argv) == 2:
    infile = sys.argv[1]
else:
    print("usage: %s <input audio file>" % sys.argv[0])
    sys.exit()

# initialize algorithms we will use
loader = EasyLoader(filename=infile)
loudness = Loudness()

# use pool to store data
pool = essentia.Pool()

loader.audio >> loudness.signal
loudness.loudness >> (pool, "loudness")

# network is ready, run it
essentia.run(loader)

print("loudness : " + pool["loudness"])

I'm running essentia==2.1b6.dev858. The input WAV file is attached. Thanks.

sample.wav.zip

palonso commented 1 year ago

Hi @dgoldenberg-audiomack,

The problem is that Loudness expects a stream of vector_real instead of real, which is the output of EasyLoader. You can use RealAccumulator to compute the loudness of the entire signal at the end of the stream.

import sys

import essentia
from essentia.streaming import EasyLoader, Loudness, RealAccumulator

if len(sys.argv) == 2:
    infile = sys.argv[1]
else:
    print("usage: %s <input audio file>" % sys.argv[0])
    sys.exit()

# initialize algorithms we will use
loader = EasyLoader(filename=infile)
loudness = Loudness()
accumulator = RealAccumulator()

# use pool to store data
pool = essentia.Pool()

loader.audio >> accumulator.data
accumulator.array >> loudness.signal
loudness.loudness >> (pool, "loudness")

# network is ready, run it
essentia.run(loader)
print("loudness : ", pool["loudness"])

Alternatively, you can check our Python example using EBULoudnessR128 which provides a loudness estimation that is more correlated with human perception and is widely used in the audio/music industry.

dgoldenberg-audiomack commented 1 year ago

Hi @palonso,

Thanks for your quick, comprehensive response. A noob to Essentia here :)

The problem is that Loudness expects a stream of vector_real instead of real, which is the output of EasyLoader.

Understood; had similar issues with MonoLoader. As a novice, I would just make a suggestion, which is that the framework and its doc set are rather vast, and finding the relevant usable sample code is not always easy and apparent.

For example, if you're just looking at https://essentia.upf.edu/reference/std_Loudness.html, it doesn't have a link to a quick useful example of the kind that you just provided. The same seems true of other doc pages under https://essentia.upf.edu/reference/ as well. So I'd venture a proposal to add coding snippets on all reference doc pages.

check our Python example using EBULoudnessR128 which provides a loudness estimation that is more correlated with human perception and is widely used in the audio/music industry.

Thank you for that reference. Looking at the outputs of that algo:

If I wanted to come up with a single metric of how loud a music sample is, would you recommend that I pick one of these metrics? i.e. how does one tell, by looking at these outputs whether something is loud, quiet, or in between? e.g. a metric from 0 to 10, 10 being 'definitely loud'?

Thanks!

palonso commented 1 year ago

 @dgoldenberg-audiomack thank you very much for the feedback!

Regarding your question, integratedLoudness is a single value suitable for your purpose. Assuming that your music is normalized to full scale (i.e., -1/1 range), values lower than -16/-20 LUFS could be considered as quiet, and values higher than -7/-6 are definitely very loud.

Note that if gain normalization is applied to your music before the loudness calculation (such as done by EasyLoader) the loudness estimations are not reliable.

dgoldenberg-audiomack commented 1 year ago

Thanks @palonso,

Assuming that your music is normalized to full scale (i.e., -1/1 range)

Could you provide an example or point me at a snippet which performs this type of normalization?

Note that if gain normalization is applied to your music before the loudness calculation (such as done by EasyLoader) the loudness estimations are not reliable.

Currently we're not yet applying any normalizations. If normalization is not applied, would EasyLoader's estimations become more reliable?

Do I understand correctly that your general recommendation is that we use EBULoudnessR128? This algo sounds like a much stronger approach, IIUC.

palonso commented 1 year ago

Could you provide an example or point me at a snippet which performs this type of normalization?

Sure, using numpy:

normalized_audio = audio / np.max(np.abs(audio))

For additional context, this is sometimes referred as peak normalization.

Currently we're not yet applying any normalizations. If normalization is not applied, would EasyLoader's estimations become more reliable?

Generally yes, but this depends a bit on your source of audio. If you are working with professionally mastered music, not applying any normalization should be fine. If you also consider processing music that is not professionally mastered, or that you suspect that its gain could have been trimmed, I would recommend applying peak normalization before estimating loudness.

Do I understand correctly that your general recommendation is that we use EBULoudnessR128?

Right, specially if your goal is to make a perceptual estimation of loudness.

dgoldenberg-audiomack commented 1 year ago

Thanks, @palonso.

When processing data in bulk, is it possible to reuse objects such as the loaders and the algos? Are they thread-safe?

palonso commented 1 year ago

Yes, you can reuse the algorithms. Following the example:

files = ["file_1", "file_2"]

# initialize algorithms we will use
loader = EasyLoader()
loudness = Loudness()
accumulator = RealAccumulator()

# use pool to store data
pool = essentia.Pool()

loader.audio >> accumulator.data
accumulator.array >> loudness.signal
loudness.loudness >> (pool, "loudness")

# network is ready, run it
for infile in files:
    pool.clear()
    essentia.reset(loader)
    loader.configure(filename=infile)
    essentia.run(loader)
    print("loudness : ", pool["loudness"])

However, Essentia is not thread-safe, so you should use separate processes to parallelize your bulk analysis.

dgoldenberg-audiomack commented 1 year ago

Perfect, thank you, @palonso!

dgoldenberg-audiomack commented 1 year ago

Hi @palonso,

I'm looking for similar sample snippets for the following:

Dissonance: Inputs

BPM https://essentia.upf.edu/reference/streaming_RhythmExtractor2013.html Inputs

Key This takes pcp (vector_real) - the input pitch class profile Would we need to use HPCP? That one also needs frequencies and magnitudes, would like some clarity on how to get those (similar to the case with Dissonance)

Would appreciate your help

palonso commented 1 year ago

Dissonance

Dissonance expects frequencies and magnitudes as output from SpectralPeaks. This requires extracting the spectrum in a frame-wise manner to extract the peaks.

The algorithm chain would be: EasyLoader/MonoLoader >> FrameCutter >> Windowing >> Spectrum >> SpectralPeaks >> Dissonance

You can find tested parametrizations of the algorithms in the unit tests, for example this one.

BPM

Is it OK to just do a loader.audio >> bpm.signal ?

Yes. Just remember to keep the sample rate at 44100 (default in MonoLoader/AudioLoader).

Key

Have a look at KeyExtractor. This is a wrapper for Key that takes audio as input and does all the required steps. loader.audio >> key_extractor.audio

dgoldenberg-audiomack commented 1 year ago

Thanks much for these pointers, @palonso. A side note on these algos; I'm noticing that some algos have multiple outputs such as for example RhythmExtractor2013. If I'm only interested in the bpm value, I'm still 'forced' to connect/extract the rest of the outputs otherwise I get something like this:

RuntimeError: RhythmExtractor2013::ticks is not connected to any sink...

I wonder if it may be of benefit to allow the caller to not connect some of the outputs to sinks?

dgoldenberg-audiomack commented 1 year ago

Hi @palonso

You can find tested parametrizations of the algorithms in the unit tests, for example this one.

If I want to extract dissonance for a wide variety of audio files for which I might not know much about upfront, would the parameters used in that test work reasonably well across the board? I mean all the params here:

        fc = FrameCutter(frameSize=4096, hopSize=512)
        windower = Windowing(type='blackmanharris62')
        specAlg = Spectrum(size=4096)
        sPeaksAlg = SpectralPeaks(sampleRate = sampleRate,
                                  maxFrequency = sampleRate/2,
                                  minFrequency = 0,
                                  orderBy = 'frequency')

Also, the algo doc says that I can grab the dissonance as the output of Dissonance. The test computes "the average dissonance over all frames of audio". I'm wondering if for my purposes I can just stick with Dissonance.dissonance? The average seems to be computed just to make sure the output is in the ballpark, correct?

Elsewhere I see samples where some of these algos are used with defaults, e.g.

framecutter = FrameCutter()
windowing = Windowing(type="blackmanharris62")
spectrum = Spectrum()
spectralpeaks = SpectralPeaks(
    orderBy="magnitude", magnitudeThreshold=1e-05, minFrequency=40, maxFrequency=5000, maxPeaks=10000
)

Here, the FrameCutter is defaulted; actually the SpectralPeaks are set differently.

I'm looking for a way to make this very generic because as I mentioned, I don't know much about the files upfront. However, if maybe there is a way to optimize the parameters first, based on the file, then I'd love to add that, if you have any recommendations. Although simpler seems better for now.

palonso commented 1 year ago

I wonder if it may be of benefit to allow the caller to not connect some of the outputs to sinks?

You can discard an algorithm's output like this: algorithm.output >> None

I'm wondering if for my purposes I can just stick with Dissonance.dissonance? The average seems to be computed just to make sure the output is in the ballpark, correct?

This is entirely up to your use case. In many cases it makes sense to work with track-averaged values, for example, to compare values between songs with different lengths.

I haven't experimented with this algorithm personally, so I can't recommend you a set of parameters. You could optimize the algorithm's parameters by annotating a small dataset with expected dissonance values yourself and trying different combinations of parameters to see which one correlates better with your annotations.

dgoldenberg-audiomack commented 1 year ago

Hi @palonso, thanks for your reply. I'm experimenting with LoudnessEBUR128.

Here's what I've got so far:

loader = AudioLoader(filename=infile)
loudness_e = LoudnessEBUR128(startAtZero=True)

pool = essentia.Pool()

loader.audio >> loudness_e.signal
loader.sampleRate >> None
loader.numberChannels >> None
loader.md5 >> None
loader.bit_rate >> None
loader.codec >> None

loudness_e.integratedLoudness >> (pool, "integrated_loudness")
loudness_e.momentaryLoudness >> None
loudness_e.shortTermLoudness >> None
loudness_e.loudnessRange >> None

essentia.run(loader)

Questions:

  1. I noticed that AudioLoader outputs a sampleRate and LoudnessEBUR128 has sampleRate as one of its parameters. With the way the code is written so far, would the sample rate automatically propagate into LoudnessEBUR128? If not, how could I pass it from the loader to the extractor algo?
  2. Would you recommend the default value 0.1 for the hop size?
  3. For peak normalization, we've discussed the following:
    normalized_audio = audio / np.max(np.abs(audio))

    How would I wire this into the 'network'? This transform as is seems to just cause me runtime errors.

Also, as far as the "perceptual loudness" assessment:

values lower than -16/-20 LUFS could be considered as quiet, and values higher than -7/-6 are definitely very loud.

So if we to use a "T-shirt" approach to discern: very quiet, quiet, regular loudness, loud, very loud, what would a good mapping be? -

loudness type value range start value range end
very quiet ? ?
quiet ? -16
normal ? ?
loud ? ?
very loud -6 ?
dgoldenberg-audiomack commented 1 year ago

Hi @palonso sorry to bombard you with questions :)

I'm looking into a few tensorflow-based algos. I've done a pip install essentia-tensorflow but keep getting errors such as this one:

Traceback (most recent call last):
  File "danceability.py", line 2, in <module>
    from essentia.standard import MonoLoader, TensorflowPredictMusiCNN
ImportError: cannot import name 'TensorflowPredictMusiCNN' from 'essentia.standard' (/home/.local/lib/python3.7/site-packages/essentia/standard.py)

Code:

import sys
from essentia.standard import MonoLoader, TensorflowPredictMusiCNN

if len(sys.argv) == 2:
    infile = sys.argv[1]
else:
    print("usage: %s <input audio file>" % sys.argv[0])
    sys.exit()

audio = MonoLoader(filename=infile, sampleRate=16000)()
model = TensorflowPredictMusiCNN(graphFilename="danceability-musicnn-msd-2.pb")
predictions = model(audio)

Versions:

essentia                                 2.1b6.dev1034
essentia-tensorflow                      2.1b6.dev1034

Any ideas as to what might be missing? Thanks

palonso commented 1 year ago

Questions:

I noticed that AudioLoader outputs a sampleRate and LoudnessEBUR128 has sampleRate as one of its parameters. With the way the code is written so far, would the sample rate automatically propagate into LoudnessEBUR128? If not, how could I pass it from the loader to the extractor algo?

Please take a look at my answer below.

Would you recommend the default value 0.1 for the hop size?

yes, we normally compute loudness with the default hop size.

How would I wire this into the 'network'? This transform as is seems to just cause me runtime errors.

To peak-normalize the audio you need to have access to the entire signal before processing. Thus, streaming mode is not the most suitable paradigm. Alternatively, you could:

  1. Define all algorithms in standard mode.
  2. Load audio and read the sample rate.
  3. Normalize the audio using numpy as mentioned.
  4. Reconfigure LoudnessEBU128 according to the audio sample rate. (algorithm.configure(SampleRate=sr))
  5. Compute loudness.

So if we to use a "T-shirt" approach to discern: very quiet, quiet, regular loudness, loud, very loud, what would a good mapping be?

I can't provide an answer since I'm not an expert on loudness metering. Howerver, since the EBU R128 standard is widely used in the industry, you should be able to find several resources related to your question online.

I'm looking into a few tensorflow-based algos. I've done a pip install essentia-tensorflow but keep getting errors such as this one:

According to your error, Python is loading essentia and not essentia-tensorflow. Since the packages are not complementary you should make sure that essentia-tensorflow is the one being loaded. You can achieve this by removing essentia and reinstalling essentia-tensorflow:

pip uninstall essentia
pip uninstall essentia-tensorflow
pip install essentia-tensortflow
dgoldenberg-audiomack commented 1 year ago

That makes sense, @palonso, thank you. I'll look into EBU R128.

Working with essentia-tensortflow, I've run into a few issues.

  1. Model file names. There are discrepancies between the model file names in the doc vs. the files which the package apparently supports, going by this location: https://essentia.upf.edu/models/. For example, for Danceability, one of the samples references danceability-musicnn-msd-2.pb but the actual file name appears to be danceability-msd-musicnn-1.pb (?)
  2. Model file handling. At runtime, the model files are not found. I get the following type of error: RuntimeError: Error while configuring TensorflowPredictMusiCNN: TensorflowPredict: could not open the Tensorflow graph file. Is this intentional in the package to cause the user to cherrypick the model files they need from the model file repository rather than bloat by downloading all of them? However, if we're to pluck them out, we'd need to maintain them in our codebase somewhere, which would bloat the codebase size and if there are updates, we may or may not get them in time. What's the intended usage pattern here, for the model files?
  3. Adding to item 2 here - Going by your comment in https://github.com/MTG/essentia/issues/1313 "Can you download the model, place it in the same folder as your script..." - placing the model next to the py file fixes the model file not found problem. Ideally, I'd like to avoid keeping the model files in the codebase. Any recommendation? If this is the only way, then do we need both the .pb file and .onnx and .json, if any?
  4. model/Sigmoid. Sample code:
    audio = MonoLoader(filename=infile, sampleRate=16000)()
    model = TensorflowPredictMusiCNN(graphFilename="danceability-msd-musicnn-1.pb")
    predictions = model(audio)

    This and a few other cases yield the below error:

    
    Traceback (most recent call last):
    File "[danceability.py](http://danceability.py/)", line 12, in <module>
    model = TensorflowPredictMusiCNN(graphFilename="danceability-msd-musicnn-1.pb")
    File "/home/airflow/.local/lib/python3.7/site-packages/essentia/[standard.py](http://standard.py/)", line 44, in __init__
    self.configure(**kwargs)
    File "/home/airflow/.local/lib/python3.7/site-packages/essentia/[standard.py](http://standard.py/)", line 64, in configure
    self.__configure__(**kwargs)
    RuntimeError: Error while configuring TensorflowPredictMusiCNN: TensorflowPredict: 'model/Sigmoid' is not a valid node name of this graph.
    TensorflowPredict: Available node names are:
    model/Placeholder, dense/kernel, dense/kernel/read, dense/bias, dense/bias/read, model/dense/MatMul, model/dense/BiasAdd, model/dense/Relu, dense_1/kernel, dense_1/kernel/read, dense_1/bias, dense_1/bias/read, model/dense_1/MatMul, model/dense_1/BiasAdd, model/Softmax.

Reconfigure this algorithm with valid node names as inputs and outputs before starting the processing.


Any recommendation as to how to fix this?
5. Lastly, a minor issue. Any recommendation on how to suppress the verbose output/warnings from TF? e.g. `Could not load dynamic library 'libcudart.so.11.0'` etc.  I was thinking something like what's described [here](https://stackoverflow.com/questions/48608776/how-to-suppress-tensorflow-warning-displayed-in-result) on SOF - ?
palonso commented 1 year ago
  1. By default use the latest version available at https://essentia.upf.edu/models/. In this case, confusion may appear from the difference between the danceability classifiers (v1, and v2) and classification heads (v1). Both options should work but currently, we recommend using the classification heads (these are the ones with examples in our site). Note that this is especially convenient in order to reuse the embeddings for multiple classifiers.
  2. Yes, for now, it is the responsibility of the user to download a set and indicate the path to the models.
  3. You don't need to have the models in your codebase. Just set graphFilename with the /path/to/your/model.pb. You don't need the onnx files but the json contains information such as the name of the output classes and the name of the input/output nodes (layers) of the model.
  4. These models have a Softmax instead of sigmoid output layer, set output=model/Softmax in TensorflowPredictMusiCNN for these cases.
  5. set TF_CPP_MIN_LOG_LEVEL=3 e.g.: TF_CPP_MIN_LOG_LEVEL=3 python my_script.py
dgoldenberg-audiomack commented 1 year ago

Thank you for your fast reply @palonso; very helpful!

palonso commented 1 year ago

So the .json files are informative; but are they required by the library? sounds like not?

Correct, they are not needed by the library.

For the predictions and embeddings that we get from the various models, what's the general strategy for their use? What I mean is, if we want to get a classifier type of value for Danceability, the doc says Music danceability (2 classes): danceable, not_danceable. How does one map the emitted predictions and/or embeddings to the class values such as danceable vs. not danceable? either as boolean or, preferably float type qualifiers.

The embeddings are an intermediate representation to get the predictions through the classification heads. The predictions are a 2D matrix [timestamps, classes] since models operate on windows of 2 seconds. A common way to process the predictions is to average the temporal dimension (first axis), which gives you a vector of overall probabilities for each class.

dgoldenberg-audiomack commented 1 year ago

Hi @palonso, sorry, could you elaborate?

audio = MonoLoader(filename=infile, sampleRate=16000)()
model = TensorflowPredictMusiCNN(graphFilename="danceability-musicnn-msd-2.pb")
predictions = model(audio)

The 2D matrix looks like this, for example:

[0.3348741 0.6392199]
[0.30579954 0.66699344]
[0.32352865 0.65015364]
[0.3717062 0.6356408]
[0.3720143 0.6452822]
[0.39193273 0.6317995 ]
....

[timestamps, classes]

Do I understand this correctly, in that, using the example values above, the probability for "is danceable" would be the average of [ 0.3348741, 0.30579954, 0.32352865, ..., 0.39193273 ] (the first column) and the probability for "is not danceable" would be the average of [ 0.6392199, 0.66699344, 0.65015364, ..., 0.6317995 ] (the second column)?

The doc just says, "danceable, not_danceable" for the classes; I just want to make sure I'm processing the output predictions correctly.


set TF_CPP_MIN_LOG_LEVEL=3

To set this in code, had to add os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" before the imports of essentia/tf otherwise this didn't have an affect.


Also, could you outline (or point me at) any steps required to get essentia-tensorflow to run on a gpu(s)?

Actually, for that, we're wondering how useful running on gpu may be in terms of performance gains? Any ballpark gauge? Thanks.

palonso commented 1 year ago

Do I understand this correctly, in that, using the example values above, the probability for "is danceable" would be the average of [ 0.3348741, 0.30579954, 0.32352865, ..., 0.39193273 ] (the first column) and the probability for "is not danceable" would be the average of [ 0.6392199, 0.66699344, 0.65015364, ..., 0.6317995 ] (the second column)?

Correct.

Also, could you outline (or point me at) any steps required to get essentia-tensorflow to run on a gpu(s)?

You need to install the CUDA and CuDNN libraries. An option is to use a package manager such as conda as explained on TensorFlow's installation guide. In our case, we need CUDA==11.2 and CUDNN=8.1: conda install -c conda-forge -y cudatoolkit=11.2 cudnn=8.1.

You can expect speed improvements up to 2X, since the extraction of the input mel spectrograms for the models still happens in the CPU and can not be accelerated.

dgoldenberg-audiomack commented 1 year ago

Thanks, @palonso.

Curious about the Music style classification algo, discogs-effnet. The doc there refers to "400 styles from the Discogs taxonomy"; I'm counting about 388 in that list. Actually, the labels.py file has 400. Am I concluding correctly that the idea is to use that labels.py as the ultimate 'source of truth' for the list of predicted genres?

More interestingly, the output of this:

audio = MonoLoader(filename=infile, sampleRate=16000)()
model = TensorflowPredictEffnetDiscogs(graphFilename="discogs-effnet-bs64-1.pb")
predictions = model(audio)

For my sample file, I got a list of 167 numpy.ndarray's each of which tends to be close to 400 in length but generally not; it's 396, 388, etc. It's not clear to me how these predictions can be mapped to the labels.

Could you describe the process of mapping of these predictions to the actual genre labels?

Also, how could we manage the updates to the genre list? If more genres are added going forward, I'd like to structure our code to be sensitive/flexible to such changes.


For danceability, this works:

loader = MonoLoader(filename=infile, sampleRate=16000)()
model = TensorflowPredictMusiCNN(graphFilename="./models/danceability-musicnn-msd-2.pb")
predictions = model(loader)

Is it possible to express this using the >> operator?

pool = essentia.Pool()
loader.audio >> model.signal
model.predictions >> (pool, "danc_predictions")
essentia.run(loader)

This yields:

Traceback (most recent call last):
  File "danceability.py", line 51, in <module>
    loader.audio >> model.signal
AttributeError: 'numpy.ndarray' object has no attribute 'audio'

Another question on the loader reuse. I'm using EasyLoader for extracting things like loudness and BPM; I'm using MonoLoader with the sampleRate of 16K for danceability and mood. Is it possible to set thing up so as to use a single loader instance, e.g. the MonoLoader with the 16K sampleRate? Or might that negatively affect other extractions such as e.g. loudness or BPM?

dgoldenberg-audiomack commented 1 year ago

Hi @palonso,

I got a list of 167 numpy.ndarray's each of which tends to be close to 400 in length but generally not; it's 396, 388, etc.

I've seen this issue once and have not seen it since, so far.

Question on the approachability model. The doc states that

The models output rather two (approachability_2c) or three (approachability_3c) levels of approachability or continous values (approachability_regression).

I assume that the 2c model outputs the approachable/non_approachable classifiers. What about the 3c? What does the 3rd classifier signify? Which of the 3 models would you recommend as the most accurate?

For example, for the same file, I'm seeing:

Similar questions on the engagement model, too.

For arousal/valence, would you recomment the DEAM or the Muse model, for general processing of files? Thanks.

dgoldenberg-audiomack commented 1 year ago

Hi @palonso ,

Question on the MTG-Jamendo genre algo. The doc states that it yields 87 classes. However, I'm getting 167, not 87. Any idea as to what the other 80 classes are?

    audio = MonoLoader(filename=infile, sampleRate=16000, resampleQuality=4)()
    embedding_model = TensorflowPredictEffnetDiscogs(
        graphFilename="./models/discogs-effnet-bs64-1.pb", output="PartitionedCall:1"
    )
    embeddings = embedding_model(audio)
    model = TensorflowPredict2D(graphFilename="./models/mtg_jamendo_genre-discogs-effnet-1.pb")
    predictions = model(embeddings)

    print(">> num preds: {}".format(len(predictions)))
dgoldenberg-audiomack commented 1 year ago

Thank you, @palonso. Could you explain the 3 predictions set? How does it work since I presume one is for the "approachable" class, one for "non_approachable", and what's the third class for?

palonso commented 1 year ago

I assume that the 2c model outputs the approachable/non_approachable classifiers. What about the 3c? What does the 3rd classifier signify? Which of the 3 models would you recommend as the most accurate?

2c's output is low-approachability, high-approachability. 3-c's output is low-approachability, medium-approachability, high-approachability. The regression model outputs continuous values from 0 to 1 from low to high and performed the best in our internal evaluation.

The same applies to the engagement model.

For arousal/valence, would you recomment the DEAM or the Muse model, for general processing of files?

What do you mean by general processing of files? Both datasets contain music data only, so the resulting models shouldn't be expected to perform well with other types of signals (e.g., solo instruments, speech), although we never assessed the performance of the models in these scenarios.

According to our study, models based on emoMusic obtained the best performance. Between Deam and Muse, Deam is better for arousal, and Muse is better for valence.

Question on the MTG-Jamendo genre algo. The doc states that it yields 87 classes. However, I'm getting 167, not 87. Any idea as to what the other 80 classes are?

Already explained above

dgoldenberg-audiomack commented 1 year ago

Thank you @palonso , as always, very helpful.

MTG-Jamendo genre algo. The doc states that it yields 87 classes. However, I'm getting 167, not 87. Already explained above...

    model = TensorflowPredict2D(graphFilename="./models/mtg_jamendo_genre-discogs-effnet-1.pb")
    predictions = model(embeddings)

Sorry for the repeat but your explanation seems to be related to a different case. I'm still not groking how to map 167 predictions to 87 classifiers - ?

palonso commented 1 year ago

the first dimension is not the number of classes but the number of timestamps. This number will vary with the length of the input audio since our models generate predictions every 3 seconds.

For most applications, it's enough to generate a single overall value by taking the mean:

overall_results = np.mean(predictions, axis=0)
dgoldenberg-audiomack commented 1 year ago

Hi @palonso. Sorry, not following. If you generate the mean of predictions, that's a single value. How does it map to the 87 classes? i.e. how does it indicate the predicted styles?

(Actually, nevermind, this yields a numpy.ndarray of 87 values :) :) )

dgoldenberg-audiomack commented 1 year ago

Hi @palonso we've noticed that the discogs music style extractor seems to tilt toward these two values a bit too actively: electronic_experimental, electronic_vaporwave. Anything that can be done on the training side to alleviate this?

For now, I'm skipping these seems to alleviate the issue somewhat, although then these two start proliferating heavily: electronic_ambient, electronic_abstract.

More generally, is there a way to collaborate with mtg to train on specific datasets toward a custom genre set?

Similarly, most songs seem to come out with moods of dark and soundscape, using the mtg_jamendo_moodtheme-discogs-effnet-1.pb model.

Any ideas on these tilts?

dgoldenberg-audiomack commented 1 year ago

Hi @palonso, another question. I keep seeing this error; could you comment on the potential cause/remedy? I see the error being handled in sourcebase.h but not yet understanding the cause.

On the invocation:

        essentia.run(self.easy_loader)

this happens:

in _perform_extraction
    essentia.run(self.easy_loader)
  File "/home/airflow/.local/lib/python3.7/site-packages/essentia/__init__.py", line 148, in run
    return _essentia.run(gen)
RuntimeError: While trying to push item into source OnsetDetectionGlobal::onsetDetections:
OnsetDetectionGlobal::onsetDetections: Could not push 1 value, output buffer is full
dgoldenberg-audiomack commented 1 year ago

Hi @palonso, another question: is there much difference between the happiness feature and the valence feature? Clearly, both of these overlap around the notion of positiveness but, any specific differences?

dgoldenberg-audiomack commented 1 year ago

Hi @palonso, can you recommend a way to work around this issue:

Code:

from essentia.standard import MonoLoader, TensorflowPredictVGGish, TensorflowPredict2D

audio = MonoLoader(filename="audio.wav", sampleRate=16000, resampleQuality=4)()
embedding_model = TensorflowPredictVGGish(graphFilename="audioset-vggish-3.pb", output="model/vggish/embeddings")
embeddings = embedding_model(audio)

model = TensorflowPredict2D(graphFilename="danceability-audioset-vggish-1.pb", output="model/Softmax")
predictions = model(embeddings)

Yields the following error:

 RuntimeError: In TensorflowPredictMusiCNN.compute: TensorflowPredict: Error running the Tensorflow session.
 Input to reshape is a tensor with 843744 values, but the requested shape requires a multiple of 6144
     [[{{node model/vggish/Reshape}}]]
dgoldenberg-audiomack commented 1 year ago

Hi @palonso , could you comment on the below findings? We've observed the following, using a test set of songs representative of the content on our site:

  1. The happy feature. The values seem too low across the board. Example: Aroma by Christian Alicea (a happy upbeat latin song):
Model Value of 'happy'
mood_happy-audioset-vggish 0.16
mood_happy-audioset-yamnet 0.17
mood_happy-discogs-effnet 0.17
mood_happy-msd-musicnn 0.28
  1. relaxed and non_relaxed seem reversed E.g. SUFFOCATE (VIP) by Kayzo (a heavy metal song) gets 0.64 0.69 0.59 0.84 (vggish, yamnet, effnet, musicnn) while the calm slow blues song Folsom Prison Blues by Johnny Cash got 0.11 0.10 0.09 0.01. Seems like these should be the other way around.
  2. sad seems to yield very high numbers, especially for genre=rap. E.g. 0.88 0.84 0.90 0.96 for SUFFOCATE (VIP) by Kayzo 0.73 0.72 0.89 0.86 for Latch by Disclosure 0.74 0.58 0.79 0.90 for God's Plan by Drake (vggish, yamnet, effnet, musicnn)

Any commentary? Thanks.

dgoldenberg-audiomack commented 1 year ago

Hi @palonso could you follow up on my last 5 comments pls? Thanks.

palonso commented 1 year ago

@dgoldenberg-audiomack I answer your questions:

Hi @palonso we've noticed that the discogs music style extractor seems to tilt toward these two values a bit too actively: electronic_experimental, electronic_vaporwave. Anything that can be done on the training side to alleviate this?

These genres (especially experimental) may be overrepresented in the training set which produces over-detections. Since the output of our model are probabilities, you can define your custom threshold for specifc classes (e.g., only consider experimental predictions when the probability > 0.7).

For now, I'm skipping these seems to alleviate the issue somewhat, although then these two start proliferating heavily: electronic_ambient, electronic_abstract.

Sure, discarding classes that are not useful for you is also an option.

More generally, is there a way to collaborate with mtg to train on specific datasets toward a custom genre set?

Yes, we can provide consultancy services for training custom models if you are interested.

Similarly, most songs seem to come out with moods of dark and soundscape, using the mtg_jamendo_moodtheme-discogs-effnet-1.pb model.

The mtg_jamendo_moodtheme subset is known to be especially noisy, so the predictions should be taken with a pinch of salt.

Hi @palonso, another question. I keep seeing this error; could you comment on the potential cause/remedy? I see the error being handled in sourcebase.h but not yet understanding the cause.

On the invocation:

    essentia.run(self.easy_loader)

Could you provide the full code and the audio producing the exception?

is there much difference between the happiness feature and the valence feature? Clearly, both of these overlap around the notion of positiveness but, any specific differences?

These models were trained with very different datasets so even if the tasks are semantically simar, the predictions may be very different. I recommend you to explore both and choose the one that performs better for your use case.

Can you recommend a way to work around this issue:

Code:

from essentia.standard import MonoLoader, TensorflowPredictVGGish, TensorflowPredict2D

audio = MonoLoader(filename="audio.wav", sampleRate=16000, resampleQuality=4)() embedding_model = TensorflowPredictVGGish(graphFilename="audioset-vggish-3.pb", output="model/vggish/embeddings") embeddings = embedding_model(audio)

model = TensorflowPredict2D(graphFilename="danceability-audioset-vggish-1.pb", output="model/Softmax") predictions = model(embeddings)

Yields the following error:

RuntimeError: In TensorflowPredictMusiCNN.compute: TensorflowPredict: Error running the Tensorflow session. Input to reshape is a tensor with 843744 values, but the requested shape requires a multiple of 6144 [[{{node model/vggish/Reshape}}]]

This is a bit weird. You use TensorflowPredictVGGish and TensorflowPredict2D but the error refers to TensorflowPredictMusiCNN. Seems like the error doesn't correspond to the code you referred.

The happy feature. The values seem too low across the board.

The dataset we used for this model is not very big, and it makes sense that there is nothing close to the example you give in the happy class.

relaxed and non_relaxed seem reversed Yes, you can check the label order in the metadata file: non_relaxed, relaxed.

sad seems to yield very high numbers

I agree that the model tends to over-detect the sad class in the examples you give.

dgoldenberg-audiomack commented 1 year ago

Thank you @palonso.

we can provide consultancy services for training custom models if you are interested.

What is the channel for us to contact MTG for that? is there a specific email address / format?

palonso commented 1 year ago

You can send us email (pablo.alonso and dmitry.bogdanov @upf.edu) to arrange a meeting

dgoldenberg-audiomack commented 1 year ago

Hi @palonso ,

As far as the exception I'm seeing, it seems to be related to the size of the input file and feels like an out-of-memory.

def print_results(model_name, aggressive, non_aggressive):
    print(f">> Classifiers with '{model_name}':")
    print("    aggressive: {}".format(aggressive))
    print("    non_aggressive: {}".format(non_aggressive))
    print()

def get_classifiers(predictions):
    import pandas as pd
    df = pd.DataFrame(predictions, columns=["col1", "col2"])
    mean_lst = df.mean().to_list()
    first = round(mean_lst[0], 4)
    second = round(mean_lst[1], 4)
    return first, second

def _get_aggressive(audio, embedding_model, pred_model_filepath):
    embeddings = embedding_model(audio)
    model = TensorflowPredict2D(graphFilename=pred_model_filepath, output="model/Softmax")
    predictions = model(embeddings)
    return get_classifiers(predictions)

def get_aggressive_msd_musicnn(audio, local=False):
    embedding_model = TensorflowPredictMusiCNN(
        graphFilename=get_path("msd-musicnn-1.pb", local), output="model/dense/BiasAdd"
    )
    return _get_aggressive(audio, embedding_model, get_path("mood_aggressive-msd-musicnn-1.pb", local))

audio = MonoLoader(filename=infile, sampleRate=16000, resampleQuality=4)()
try:
    aggressive, non_aggressive = get_aggressive_msd_musicnn(audio, local=True)
    print_results("mood_aggressive-msd-musicnn", aggressive, non_aggressive)
except Exception as e:
    print("@@@ ERROR:")
    print(str(e))

In EMR, I get the following type of error:

Msg: While trying to push item into source OnsetDetectionGlobal::onsetDetections: OnsetDetectionGlobal::onsetDetections: Could not push 1 value, output buffer is full

Locally, my python process just gets killed which really feels like an OOM.

What's the usual practice for dealing with large wav files? and/or might this need to be fixed in the api?

I've shared a sample file with you via Dropbox (only up to 25 MB is allowed here).

dgoldenberg-audiomack commented 1 year ago

Hi @palonso ,

Wondering if we can work around this by setting a limit on the amount of content processed per file?

EasyLoader has:

endTime (real ∈ [0, ∞), default = 1e+06) :
the end time of the slice to be extracted [s]

and TensorflowPredictEffnetDiscogs, TensorflowPredictMusiCNN, TensorflowPredictVGGish, TensorflowPredictEffnetDiscogs have:

batchSize (integer ∈ [-1, ∞), default = 64) :
the batch size for prediction. This allows parallelization when GPUs are available.
Set it to -1 or 0 to accumulate all the patches and run a single TensorFlow session 
at the end of the stream

MonoLoader doesn't seem to have any limiting abilities, however. Perhaps we can do away with using monoloader in favor of easyloader..

dgoldenberg-audiomack commented 1 year ago

Hi @palonso,

Another issue. We're experimenting with fingerprint extraction per the doc.

The code is as below. We keep getting the following error:

Traceback (most recent call last):
  File "./af_utils/fingerprint.py", line 34, in <module>
    fp_int = ai.chromaprint.decode_fingerprint(fp)[0]
  File "/home/airflow/.local/lib/python3.7/site-packages/chromaprint.py", line 172, in decode_fingerprint
    ctypes.byref(algorithm), 1 if base64 else 0
ctypes.ArgumentError: argument 1: <class 'TypeError'>: wrong type

We're using pyacoustid 1.2.2.

Any ideas as to what might be going wrong?

Code:

import sys

import acoustid as ai
import essentia
from essentia.streaming import MonoLoader, Chromaprinter

if __name__ == "__main__":
    if len(sys.argv) == 2:
        infile = sys.argv[1]
    else:
        print("usage: %s <input audio file>" % sys.argv[0])
        sys.exit()

    loader = MonoLoader(filename=infile)
    fps = Chromaprinter(analysisTime=30, concatenate=True)
    pool = essentia.Pool()

    loader.audio >> fps.signal
    fps.fingerprint >> (pool, "chromaprint")

    essentia.run(loader)

    cp = pool["chromaprint"]
    if len(cp) > 0:
        fp = cp[0]
        print()
        print("@@@ Fingerprint:")
        print()
        print(fp)
        print("*" * 80)
        print()
        print("@@@ Numerical representation:")
        fp_int = ai.chromaprint.decode_fingerprint(fp)[0]
        for x in fp_int:
            print(x)
    else:
        print("Error: failed to extract fingerprint.")
dgoldenberg-audiomack commented 1 year ago

Hi @palonso,

Regarding the fingerprint, having another issue while trying to make a call into Acoustid, per the tutorial:

loader = MonoLoader(filename=infile)
fps = Chromaprinter(analysisTime=30, concatenate=True)
duration = len(loader.audio) / 44100

yields this error:

TypeError: object of type '_StreamConnector' has no len()

while

from essentia.standard import MonoLoader
audio = MonoLoader(filename=infile, sampleRate=44100)()
fingerprint = Chromaprinter()(audio)
duration = ...

yields

   fingerprint = Chromaprinter()(audio)
TypeError: 'StreamingAlgo' object is not callable

What's the recommended coding pattern here?

dgoldenberg-audiomack commented 1 year ago

Hi @palonso ,

I'm also seeing the fingerprint occasionally just silently not generated and thus nothing comes in via pool["chromaprint"]. Any idea as to why this may be happening?

    cp = pool["chromaprint"]
    if len(cp) > 0:
        fp = cp[0]
        print()
        print("@@@ Fingerprint:")
        print()
        print(fp)
        print("*" * 80)
        print()
        print("@@@ Numerical representation:")
        fp_int = ai.chromaprint.decode_fingerprint(fp)[0]
        for x in fp_int:
            print(x)
    else:
        print("Error: failed to extract fingerprint.")
dbogdanov commented 7 months ago

Closing this for now. If you have further questions unrelated to loudness algorithms, please open new issues.