asus4 / tf-lite-unity-sample

TensorFlow Lite Samples on Unity
834 stars 249 forks source link

SetInputTensorData fails with "Exception: TensorFlowLite operation failed" although seemingly correct input array #292

Closed achimmihca closed 1 year ago

achimmihca commented 1 year ago

Environment:

Describe the bug I get an error with seemingly correct inputs:

Exception: TensorFlowLite operation failed.
TensorFlowLite.Interpreter.ThrowIfError (TensorFlowLite.Interpreter+Status status) (at ./Packages/com.github.asus4.tflite/Runtime/Interpreter.cs:221)
TensorFlowLite.Interpreter.SetInputTensorData (System.Int32 inputTensorIndex, System.Array inputTensorData) (at ./Packages/com.github.asus4.tflite/Runtime/Interpreter.cs:122)
PitchDetector.DummyInvoke () (at Assets/Samples/SpicePitchDetection/PitchDetector.cs:36)
SpicePitchDetectionDemo.Update () (at Assets/Samples/SpicePitchDetection/SpicePitchDetectionDemo.cs:62)

I am trying to run the SPICE model for pitch detection. So far only with dummy samples (i.e. an empty float array):

public class PitchDetector : IDisposable
{
    private readonly Interpreter interpreter;

    public PitchDetector(string modelPath, InterpreterOptions options)
    {
        try
        {
            interpreter = new Interpreter(FileUtil.LoadFile(modelPath), options);
        }
        catch (System.Exception e)
        {
            interpreter?.Dispose();
            throw e;
        }

        interpreter.LogIOInfo();
        interpreter.AllocateTensors();
    }

    public void Dispose()
    {
        interpreter?.Dispose();
    }

    public void DummyInvoke()
    {
        float[] dummySamples = new float[16000];
        // CRASH in SetInputTensorData
        interpreter.SetInputTensorData(0, dummySamples);
        interpreter.Invoke();
    }
}

The model is loaded correctly and I get the following information from interpreter.LogIOInfo():

Version: 2.11.0

Input [0]: name: input_audio_samples, type: Float32, dimensions: [1], quantizationParams: {scale: 0 zeroPoint: 0}

Output [0]: name: pitch, type: Float32, dimensions: [1], quantizationParams: {scale: 0 zeroPoint: 0}
Output [1]: name: uncertainty, type: Float32, dimensions: [1], quantizationParams: {scale: 0 zeroPoint: 0}

However, it crashes in SetInputTensorData although the input data seems to be correct (i.e. a float array of samples).

Note that this is my first attempt at using a TensorFlow model so I might be missing something obvious.

To Reproduce Steps to reproduce the behavior:

  1. Download SPICE model from here
  2. Add above PitchDetector to run the model
  3. See error

Expected behavior Setting the input in the correct format should not lead to a crash.

Questions

asus4 commented 1 year ago

The SPICE model looks interesting and might be a good starting point for the audio demo in this repo. I will look into this a bit

achimmihca commented 1 year ago

Any update on this? I tried to convert the model to ONNX to use it with Unity Barracuda but the conversion fails with

Current implementation of RFFT or FFT only allows ComplexAbs as consumer not {'Real', 'Imag'}

I guess the model might be using some newer TensorFlow features or something

paradigmn commented 1 year ago

Hi @achimmihca , maybe I can shed some light on this. A few years ago I was going to use the Spice model for my UltrastarPitch software. I ran into problems similar to the ones you describe, which caused me to abandon the idea.

First of all, Spice was developed in Tensorflow 1. At that time, TF and Keras were two separate entities. Therefore, Spice is not implemented as a Keras model (as is the default in TF2), but rather as a functional graph. Such a graph is essentially a black box that transforms an input variable into an output variable. A model, on the other hand, is a concatenation of tensor operations, each of which expects its own batch dimension. This is where the TFLite model conversion fails. Spice does not have a dedicated batch dimension, but expects a vector of arbitrary size. The TFLite converter takes this vector dimension and hard-codes it to one.

The ONNX bug is most likely of a similar nature. The ONNX standard only recently implemented complex tensors and thus FFT-based operations. My guess is that the TF1 FFT operation does not conform to the TF2 format that the ONNX converter expects.

Now the question how to solve this problem. If there is a reference implementation somewhere, it could be ported to a new TF format. But I have not found anything so far. If you just want to run the hub model, starting a TF serving session is probably the most successful approach. Just keep in mind that Spice was trained on vocal stems (according to the paper). So I would not expect good performance on music data.

Some time ago I tried to port UltrastarPitch to C# to make it available in USPlay. There were a few things that made me reconsider. First of all, you have to train your AI model on a specific sample rate, 16kHz is usually a good value. So you have to resample all the audio before you feed it to the model. Such a feature was not available back then (maybe it is now?). Also, vector calculations are really inefficiently computed in C#, resulting in all pre- and post-processing being incredibly slow while the PC is working at maximum capacity.

I hope this information was helpful and not demoralizing. If you have specific questions, feel free to contact me on discord. Unfortunately, I no longer work in AI, so I am not up to date on this.

achimmihca commented 1 year ago

Thanks, this explains the difficulties that I face with SPICE.

AI model on a specific sample rate

Yes, performance issues due to resampling was also one of my concerns. However, at least for the song editor of USPlay it may be useful because there, it does not require real-time performance and might give more accurate results on extracted vocals compared to the dynamic wavelet algorithm.

I wonder whether there are more up-to-date models than SPICE. I think the approach described in the paper is neat, so others may have repeated it.

paradigmn commented 1 year ago

As far as I know, there is not much work done on the topic of pitch detection. Neither on Github or in Research any efforts other than Spice and my attempt were made. If you want, you can make use the ONNX PitchNet model I trained a couple of years ago. It relies on a lightweight CNN architecture and was trained on the MLP Karaoke database. However, the pre- and post processing must be ported to C#. I tried that (more or less successfully) with an even older version of the model. Some modules such as the optimized FFT routine could still be of use.

achimmihca commented 1 year ago

I wonder whether there are more up-to-date models

Spotify did create a pitch detection tool that outputs MIDI files. It is called basic pitch

However, I am missing an option to have only a single note at a time. The model seems to assume that there can always be polyphony, which is not optimal when analyzing pitch of a single vocal track.

Anyway, nice to see that there is still work done on this topic.

The paper on Basic Pitch by Spotify is interesting and provides good pointers towards other automatic music transcription (AMT) systems.

From the conclusion:

NMP (i.e. the model behind basic pitch) achieves state-of-the-art results on GuitarSet. It however did not outperform the instrument-specific models for piano and vocals.

I am interested in the instrument-specific model for vocals:

Vocano [9] is a monophonic vocal transcription method which first performs vocal source separation, then applies a pre-trained pitch extractor followed by a note segmentation neural network, trained on solo vocal data.

[9] J.-Y. Hsu and L. Su, “VOCANO: A note transcription frame-work for singing voice in polyphonic music,” in Proc. ISMIR, 2021

The VOCANO paper used Patch-CNN for pitch detection.


Overall, the SPICE model does not seem to be state-of-the-art anymore. Thus, I will close this issue.