MTG / essentia

C++ library for audio and music analysis, description and synthesis, including Python bindings
http://essentia.upf.edu
GNU Affero General Public License v3.0
2.82k stars 529 forks source link

Problem connecting the Pipeline for DiscogsEffnet ('StreamingAlgo' object has no attribute 'poolIn') #1392

Closed perli99 closed 4 months ago

perli99 commented 8 months ago

Hello, I tried to use the same procedure like this https://github.com/palonso/mtg-general-meeting-03-2020-essentia-tensorflow/blob/master/demo-realtime-essentia-tensorflow.ipynb for the live implementation of the DiscogsEffnet by just swapping the ML-Model to the combination of the Discogs embedding model and header but stumbled on this problem while connecting the pipeline:


vimp = VectorInput(buffer)
fc = FrameCutter(frameSize=frameSize, hopSize=hopSize)
tim = TensorflowInputMusiCNN()
vtt = VectorRealToTensor(shape=[1, 1, patchSize, numberBands], lastPatchMode='discard')
ttp = TensorToPool(namespace=inputLayer)

# Embedding model
tfpED = TensorflowPredictEffnetDiscogs(graphFilename=embeddingModelName, output='PartitionedCall:1')

# Genre prediction model
model = TensorflowPredict2D(graphFilename=predictionModelName, input=inputLayer, output=outputLayer)```

vimp.data >> fc.signal
fc.frame >> tim.frame
tim.bands >> vtt.frame
tim.bands >> (pool, 'melbands')
vtt.tensor >> ttp.tensor
ttp.pool >> tfpED.poolIn
tfpED.poolOut >> (pool, 'embeddings')

embeddings_tensor = PoolToTensor(namespace='embeddings')
(pool, 'embeddings') >> embeddings_tensor.tensor
embeddings_tensor.tensor >> model.tensorIn
model.tensorOut >> (pool, outputLayer)

Which produced the error message: Cell In[43], line 7 5 tim.bands >> (pool, 'melbands') 6 vtt.tensor >> ttp.tensor ----> 7 ttp.pool >> tfpED.poolIn 8 tfpED.poolOut >> (pool, 'embeddings') 10 embeddings_tensor = PoolToTensor(namespace='embeddings')

AttributeError: 'StreamingAlgo' object has no attribute 'poolIn'

I searched the src on the github page and poolIn seems to appear under /src/algorithms/machinelearning/tensorflowpredicteffnetdiscogs.cpp, but in a already pre-built algorithm:

  AlgorithmFactory& factory = AlgorithmFactory::instance();

  _frameCutter            = factory.create("FrameCutter");
  _tensorflowInputMusiCNN = factory.create("TensorflowInputMusiCNN");
  _vectorRealToTensor     = factory.create("VectorRealToTensor");
  _tensorToPool           = factory.create("TensorToPool");
  _tensorflowPredict      = factory.create("TensorflowPredict");
  _poolToTensor           = factory.create("PoolToTensor");
  _tensorToVectorReal     = factory.create("TensorToVectorReal");

  _tensorflowInputMusiCNN->output("bands").setBufferType(BufferUsage::forMultipleFrames);

  _signal                                  >> _frameCutter->input("signal");
  _frameCutter->output("frame")            >> _tensorflowInputMusiCNN->input("frame");
  _tensorflowInputMusiCNN->output("bands") >> _vectorRealToTensor->input("frame");
  _vectorRealToTensor->output("tensor")    >> _tensorToPool->input("tensor");
  _tensorToPool->output("pool")            >> _tensorflowPredict->input("poolIn");
  _tensorflowPredict->output("poolOut")    >> _poolToTensor->input("pool");
  _poolToTensor->output("tensor")          >> _tensorToVectorReal->input("tensor");

  attach(_tensorToVectorReal->output("frame"), _predictions);

  _network = new scheduler::Network(_frameCutter);
}

So do i not have to build the pipeline and just connect the embedding & header? If yes how do I do that. Sorry if this is rather obvious aswell. Friendly Regards & I hope you had a good start into the new year :)

palonso commented 8 months ago

Hi @perli99, as you mention, the TensorflowPredict"Model" algorithms are wrappers containing all steps of the pipeline inside. This is how to make it work in streaming mode:

import numpy as np
from essentia.streaming import *
from essentia import Pool, run

# model parameters
inputLayerED = "serving_default_melspectrogram"
outputLayerED = "PartitionedCall:1"

inputLayer = "model/Placeholder"
outputLayer = "model/Softmax"

embeddingModelName = "discogs-effnet-bs64-1.pb"
predictionModelName = "danceability-discogs-effnet-1.pb"

# with the current configuration, we need > 64 seconds to make a prediction
sampleRate = 16000
buffer = np.zeros(sampleRate * 65, dtype="float32")

vimp = VectorInput(buffer)
# Embedding model
tfpED = TensorflowPredictEffnetDiscogs(
    graphFilename=embeddingModelName,
    input=inputLayerED,
    output=outputLayerED,
)
model = TensorflowPredict2D(
    graphFilename=predictionModelName,
    input=inputLayer,
    output=outputLayer,
    dimensions=1280,
)

pool = Pool()

vimp.data >> tfpED.signal
tfpED.predictions >> model.features
model.predictions >> (pool, outputLayer)

run(vimp)

print(pool[outputLayer].shape)

The main problem to make EffnetDiscogs work in real-time is that, right now, we only have versions of the model requiring a fixed batch size of 64 (discogs-effnet-bs64-1.pb). This means that you need enough audio to generate 64 patches of ~2 seconds in order to get a prediction.

Please, let me know if the current model is enough for your application or if you would like to have a bs1 version, suitable for close-to-real-time operation.

perli99 commented 8 months ago

Hey @palonso thank you for your fast reply.

So if i understand correctly this model needs 64 batches of ~2 secs, so ~128 seconds until it can make an prediction?

I would like to have a solution where the latency is ideally not much more than 1 second. For my Bachelor thesis I build a robot that "listens" to live music, extracts features and then paints a picture based on those features. I want one of those features to be the genre (also mood, energy...) and i figured the discogs Model would be nice for this, because if I combine some of the genres into broader categories (Rock,Jazz...) I would get a pretty good accuracy and and the live implementation as shown here https://www.youtube.com/watch?v=Cp0zkojT9RQ seemed to be close to real time.

So yes if the bs1 version is faster, than that one is probably the right one for me, where do i get that one? Or would you advise that i use one of the other models all together? I liked the EffnetDiscogs, because i would be able to use different headers and also the big number of the underlying Training Data seems nice.

Thank you for your help already, I really appreciate it Vincent

palonso commented 4 months ago

Sorry for forgetting about this!

We have uploaded a version of discogs-effnet that operates with batchSize=1, suitable for low latency applications.

This is how to adapt the previous example for this case:

import os

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

import numpy as np
from essentia.streaming import *
from essentia import Pool, run

# model parameters
inputLayerED = "serving_default_melspectrogram"
outputLayerED = "PartitionedCall:1"

inputLayer = "model/Placeholder"
outputLayer = "model/Softmax"

embeddingModelName = "discogs-effnet-bs1-1.pb"
predictionModelName = "danceability-discogs-effnet-1.pb"

sampleRate = 16000
buffer = np.zeros(sampleRate * 3, dtype="float32")

vimp = VectorInput(buffer)
# Embedding model
tfpED = TensorflowPredictEffnetDiscogs(
    graphFilename=embeddingModelName,
    input=inputLayerED,
    output=outputLayerED,
    batchSize=1,
)
model = TensorflowPredict2D(
    graphFilename=predictionModelName,
    input=inputLayer,
    output=outputLayer,
    dimensions=1280,
)

pool = Pool()

vimp.data >> tfpED.signal
tfpED.predictions >> model.features
model.predictions >> (pool, outputLayer)

run(vimp)

print(pool[outputLayer].shape)