MTG / essentia

C++ library for audio and music analysis, description and synthesis, including Python bindings
GNU Affero General Public License v3.0
2.82k stars 529 forks source link

Problem connecting the Pipeline for DiscogsEffnet ('StreamingAlgo' object has no attribute 'poolIn') #1392

Closed perli99 closed 4 months ago

perli99 commented 8 months ago

Hello, I tried to use the same procedure like this for the live implementation of the DiscogsEffnet by just swapping the ML-Model to the combination of the Discogs embedding model and header but stumbled on this problem while connecting the pipeline:

vimp = VectorInput(buffer)
fc = FrameCutter(frameSize=frameSize, hopSize=hopSize)
tim = TensorflowInputMusiCNN()
vtt = VectorRealToTensor(shape=[1, 1, patchSize, numberBands], lastPatchMode='discard')
ttp = TensorToPool(namespace=inputLayer)

# Embedding model
tfpED = TensorflowPredictEffnetDiscogs(graphFilename=embeddingModelName, output='PartitionedCall:1')

# Genre prediction model
model = TensorflowPredict2D(graphFilename=predictionModelName, input=inputLayer, output=outputLayer)``` >> fc.signal
fc.frame >> tim.frame
tim.bands >> vtt.frame
tim.bands >> (pool, 'melbands')
vtt.tensor >> ttp.tensor
ttp.pool >> tfpED.poolIn
tfpED.poolOut >> (pool, 'embeddings')

embeddings_tensor = PoolToTensor(namespace='embeddings')
(pool, 'embeddings') >> embeddings_tensor.tensor
embeddings_tensor.tensor >> model.tensorIn
model.tensorOut >> (pool, outputLayer)

Which produced the error message: Cell In[43], line 7 5 tim.bands >> (pool, 'melbands') 6 vtt.tensor >> ttp.tensor ----> 7 ttp.pool >> tfpED.poolIn 8 tfpED.poolOut >> (pool, 'embeddings') 10 embeddings_tensor = PoolToTensor(namespace='embeddings')

AttributeError: 'StreamingAlgo' object has no attribute 'poolIn'

I searched the src on the github page and poolIn seems to appear under /src/algorithms/machinelearning/tensorflowpredicteffnetdiscogs.cpp, but in a already pre-built algorithm:

  AlgorithmFactory& factory = AlgorithmFactory::instance();

  _frameCutter            = factory.create("FrameCutter");
  _tensorflowInputMusiCNN = factory.create("TensorflowInputMusiCNN");
  _vectorRealToTensor     = factory.create("VectorRealToTensor");
  _tensorToPool           = factory.create("TensorToPool");
  _tensorflowPredict      = factory.create("TensorflowPredict");
  _poolToTensor           = factory.create("PoolToTensor");
  _tensorToVectorReal     = factory.create("TensorToVectorReal");


  _signal                                  >> _frameCutter->input("signal");
  _frameCutter->output("frame")            >> _tensorflowInputMusiCNN->input("frame");
  _tensorflowInputMusiCNN->output("bands") >> _vectorRealToTensor->input("frame");
  _vectorRealToTensor->output("tensor")    >> _tensorToPool->input("tensor");
  _tensorToPool->output("pool")            >> _tensorflowPredict->input("poolIn");
  _tensorflowPredict->output("poolOut")    >> _poolToTensor->input("pool");
  _poolToTensor->output("tensor")          >> _tensorToVectorReal->input("tensor");

  attach(_tensorToVectorReal->output("frame"), _predictions);

  _network = new scheduler::Network(_frameCutter);

So do i not have to build the pipeline and just connect the embedding & header? If yes how do I do that. Sorry if this is rather obvious aswell. Friendly Regards & I hope you had a good start into the new year :)

palonso commented 8 months ago

Hi @perli99, as you mention, the TensorflowPredict"Model" algorithms are wrappers containing all steps of the pipeline inside. This is how to make it work in streaming mode:

import numpy as np
from essentia.streaming import *
from essentia import Pool, run

# model parameters
inputLayerED = "serving_default_melspectrogram"
outputLayerED = "PartitionedCall:1"

inputLayer = "model/Placeholder"
outputLayer = "model/Softmax"

embeddingModelName = "discogs-effnet-bs64-1.pb"
predictionModelName = "danceability-discogs-effnet-1.pb"

# with the current configuration, we need > 64 seconds to make a prediction
sampleRate = 16000
buffer = np.zeros(sampleRate * 65, dtype="float32")

vimp = VectorInput(buffer)
# Embedding model
tfpED = TensorflowPredictEffnetDiscogs(
model = TensorflowPredict2D(

pool = Pool() >> tfpED.signal
tfpED.predictions >> model.features
model.predictions >> (pool, outputLayer)



The main problem to make EffnetDiscogs work in real-time is that, right now, we only have versions of the model requiring a fixed batch size of 64 (discogs-effnet-bs64-1.pb). This means that you need enough audio to generate 64 patches of ~2 seconds in order to get a prediction.

Please, let me know if the current model is enough for your application or if you would like to have a bs1 version, suitable for close-to-real-time operation.

perli99 commented 8 months ago

Hey @palonso thank you for your fast reply.

So if i understand correctly this model needs 64 batches of ~2 secs, so ~128 seconds until it can make an prediction?

I would like to have a solution where the latency is ideally not much more than 1 second. For my Bachelor thesis I build a robot that "listens" to live music, extracts features and then paints a picture based on those features. I want one of those features to be the genre (also mood, energy...) and i figured the discogs Model would be nice for this, because if I combine some of the genres into broader categories (Rock,Jazz...) I would get a pretty good accuracy and and the live implementation as shown here seemed to be close to real time.

So yes if the bs1 version is faster, than that one is probably the right one for me, where do i get that one? Or would you advise that i use one of the other models all together? I liked the EffnetDiscogs, because i would be able to use different headers and also the big number of the underlying Training Data seems nice.

Thank you for your help already, I really appreciate it Vincent

palonso commented 4 months ago

Sorry for forgetting about this!

We have uploaded a version of discogs-effnet that operates with batchSize=1, suitable for low latency applications.

This is how to adapt the previous example for this case:

import os

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

import numpy as np
from essentia.streaming import *
from essentia import Pool, run

# model parameters
inputLayerED = "serving_default_melspectrogram"
outputLayerED = "PartitionedCall:1"

inputLayer = "model/Placeholder"
outputLayer = "model/Softmax"

embeddingModelName = "discogs-effnet-bs1-1.pb"
predictionModelName = "danceability-discogs-effnet-1.pb"

sampleRate = 16000
buffer = np.zeros(sampleRate * 3, dtype="float32")

vimp = VectorInput(buffer)
# Embedding model
tfpED = TensorflowPredictEffnetDiscogs(
model = TensorflowPredict2D(

pool = Pool() >> tfpED.signal
tfpED.predictions >> model.features
model.predictions >> (pool, outputLayer)

