apple / turicreate

Turi Create simplifies the development of custom machine learning models.
BSD 3-Clause "New" or "Revised" License
11.19k stars 1.14k forks source link

Sound Classifier Accuracy poor on speaker recognition, unlike CreateML #2942

Closed MatthewWaller closed 4 years ago

MatthewWaller commented 4 years ago

I used LibriSpeech to classify the speakers. I broke up my data and trained it with CreateML, and with 147 classes, the outcome is excellent, or at least it claims 98% on testing.

However, I would like to make the model updatable, and this is something I can't do with CreateML since it uses a GLM at the end.

So I tried using the TC sound classifier since it has a neural network classifier at the end with updatable dense layers. I went through the sample project in the documentation and everything worked fine with TC's sound classifier. Then I tried to use my own data, the same I used with CreateML, and it never learns at all.

Is there something I might be missing? Is CreateML's approach super different than TC's? I've looked at both models with CoreML tools and see that they're both pipeline classifiers, and that they start with a preprocessing model, a feature abstractor, and then, for TC, the neural network, and for CreateML, the GLM.

Any recommendations on getting TC's sound classifier to work with my data? They're 16000 sample rate waves with silence removed and then divided into 1 second files.

Thanks!

srikris commented 4 years ago

To clarify, are you getting issues with test set accuracy after training the model using the TC sound classifier?

Can you share the logs with training and test accuracy?

MatthewWaller commented 4 years ago

That is correct, I'm having issues with accuracy in training, validation, and testing.

Will do. I'm running a fresh pass to make sure I'm not going crazy, but I'll get the logs for TC as well as a screenshot of what I'm seeing for CreateML.

I can also share the data. It's publicly available LibriSpeech, I've just given it a light processing and splitting up, and generated a csv.

MatthewWaller commented 4 years ago
WARNING:root:TensorFlow version 2.0.0 detected. Last version known to be fully compatible is 1.14.0 .
WARNING:tensorflow:From /Users/matthewwaller/opt/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /Users/matthewwaller/opt/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Creating a validation set from 5 percent of training data. This may take a while.
    You can set ``validation_set=None`` to disable validation tracking.

Preprocessing audio data -
Preprocessed 6839 of 35167 examples
Preprocessed 20463 of 35167 examples
Preprocessed 34034 of 35167 examples
Preprocessed 35167 of 35167 examples

Extracting deep features -
Extracted 6423 of 35167
Extracted 12845 of 35167
Extracted 19346 of 35167
Extracted 25844 of 35167
Extracted 32415 of 35167
Extracted 35167 of 35167

Preparing validation set

Training a custom neural network -
+-------------------------+-------------------------+-------------------------+-------------------------+
| Iteration               | Training Accuracy       | Validation Accuracy (%) | Elapsed Time            |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 1                       | 0.009                   | 0.010                   | 272.499                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 2                       | 0.009                   | 0.010                   | 278.563                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 3                       | 0.009                   | 0.010                   | 284.493                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 4                       | 0.009                   | 0.010                   | 289.823                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 5                       | 0.009                   | 0.010                   | 295.210                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 6                       | 0.009                   | 0.010                   | 300.643                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 7                       | 0.009                   | 0.010                   | 305.964                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 8                       | 0.009                   | 0.010                   | 311.301                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 9                       | 0.009                   | 0.010                   | 316.884                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 10                      | 0.009                   | 0.010                   | 322.197                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 11                      | 0.009                   | 0.010                   | 327.526                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 12                      | 0.009                   | 0.010                   | 332.883                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 13                      | 0.009                   | 0.010                   | 338.217                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 14                      | 0.009                   | 0.010                   | 343.616                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 15                      | 0.009                   | 0.010                   | 348.946                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 16                      | 0.009                   | 0.010                   | 354.368                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 17                      | 0.009                   | 0.010                   | 359.809                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 18                      | 0.009                   | 0.010                   | 365.522                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 19                      | 0.009                   | 0.010                   | 371.020                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 20                      | 0.009                   | 0.010                   | 376.368                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 21                      | 0.009                   | 0.010                   | 381.982                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 22                      | 0.009                   | 0.010                   | 387.630                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 23                      | 0.009                   | 0.010                   | 393.183                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 24                      | 0.009                   | 0.010                   | 398.788                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 25                      | 0.009                   | 0.010                   | 404.264                 |
+-------------------------+-------------------------+-------------------------+-------------------------+
{'accuracy': 0.00946372239747634, 'auc': nan, 'precision': 0.00946372239747634, 'recall': 0.006944444444444444, 'f1_score': 0.00013020833333333336, 'log_loss': 4.975145686501957, 'confusion_matrix': Columns:
    target_label    int
    predicted_label int
    count   int

Rows: 144

Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|     1688     |       2609      |   21  |
|     1650     |       2609      |   34  |
|     237      |       2609      |   25  |
|     5142     |       2609      |   27  |
|     6267     |       2609      |   32  |
|     1188     |       2609      |   26  |
|     2033     |       2609      |   30  |
|     4294     |       2609      |   29  |
|     3576     |       2609      |   29  |
|     7729     |       2609      |   28  |
+--------------+-----------------+-------+
[144 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns., 'roc_curve': Columns:
    threshold   float
    fpr float
    tpr float
    p   int
    n   int
    class   int

Rows: 14600146

Data:
+-----------+-----+-----+----+------+-------+
| threshold | fpr | tpr | p  |  n   | class |
+-----------+-----+-----+----+------+-------+
|    0.0    | 1.0 | 1.0 | 29 | 4092 |   61  |
|   1e-05   | 1.0 | 1.0 | 29 | 4092 |   61  |
|   2e-05   | 1.0 | 1.0 | 29 | 4092 |   61  |
|   3e-05   | 1.0 | 1.0 | 29 | 4092 |   61  |
|   4e-05   | 1.0 | 1.0 | 29 | 4092 |   61  |
|   5e-05   | 1.0 | 1.0 | 29 | 4092 |   61  |
|   6e-05   | 1.0 | 1.0 | 29 | 4092 |   61  |
|   7e-05   | 1.0 | 1.0 | 29 | 4092 |   61  |
|   8e-05   | 1.0 | 1.0 | 29 | 4092 |   61  |
|   9e-05   | 1.0 | 1.0 | 29 | 4092 |   61  |
+-----------+-----+-----+----+------+-------+
[14600146 rows x 6 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

Below is what I got with the same data from CreateML

Screen Shot 2020-01-18 at 2 18 14 PM
srikris commented 4 years ago

The 2 implementations are nearly identical minus the last layer training so this should not happen. This is indeed very strange. Is there any way you can share a sample of the data in a re-pro step so we can test this out and get to the bottom of this?

srikris commented 4 years ago

I noticed that you used LibriSpeech. Let me see if I can repro this.

srikris commented 4 years ago

@MatthewWaller Can you share the exact pre-processing you did on the dataset just to make sure we can repro exactly what you did.

MatthewWaller commented 4 years ago

@MatthewWaller Can you share the exact pre-processing you did on the dataset just to make sure we can repro exactly what you did.

Sure thing:

First there is this Voice Activity Detection class to drop silences


class VoiceActivityDetection:

    def __init__(self):
        self.__step = 10
        self.__buffer_size = 10
        self.__buffer = numpy.array([],dtype=numpy.int16)
        self.__out_buffer = numpy.array([],dtype=numpy.int16)
        self.__n = 0
        self.__VADthd = 0.
        self.__VADn = 0.
        self.__silence_counter = 0

    # Voice Activity Detection
    # Adaptive threshold
    def vad(self, _frame):
        frame = numpy.array(_frame) ** 2.
        result = True
        threshold = 0.1
        thd = numpy.min(frame) + numpy.ptp(frame) * threshold
        self.__VADthd = (self.__VADn * self.__VADthd + thd) / float(self.__VADn + 1.)
        self.__VADn += 1.

        if numpy.mean(frame) <= self.__VADthd:
            self.__silence_counter += 1
        else:
            self.__silence_counter = 0

        if self.__silence_counter > 20:
            result = False
        return result

    # Push new audio samples into the buffer.
    def add_samples(self, data):
        self.__buffer = numpy.append(self.__buffer, data)
        result = len(self.__buffer) >= self.__buffer_size
        # print('__buffer size %i'%self.__buffer.size)
        return result

    # Pull a portion of the buffer to process
    # (pulled samples are deleted after being
    # processed
    def get_frame(self):
        window = self.__buffer[:self.__buffer_size]
        self.__buffer = self.__buffer[self.__step:]
        # print('__buffer size %i'%self.__buffer.size)
        return window

    # Adds new audio samples to the internal
    # buffer and process them
    def process(self, data):
        if self.add_samples(data):
            while len(self.__buffer) >= self.__buffer_size:
                # Framing
                window = self.get_frame()
                # print('window size %i'%window.size)
                if self.vad(window):  # speech frame
                    self.__out_buffer = numpy.append(self.__out_buffer, window)
                # print('__out_buffer size %i'%self.__out_buffer.size)

    def get_voice_samples(self):
        return self.__out_buffer

Then a function we'll use to split the sample values into 1 second intervals

import itertools

def split_seq(iterable, size):
    it = iter(iterable)
    item = list(itertools.islice(it, size))
    while item:
        yield item
        item = list(itertools.islice(it, size))
import itertools
import os
import scipy.io.wavfile as wf
import numpy
import sox
from pathlib import Path

Now here is where we walk the files, create a new folder for the collection of folders, and then split them up into one second files.

devClean_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/dev-clean-wav"
devOther_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/dev-other-wav"
testClean_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/test-clean-wav"
testOther_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/test-other-wav"
trainClean_100_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/train-clean-100-wav"
trainClean_360_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/train-clean-360-wav"
trainOther_500_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/train-other-500-wav"

wav_inputPaths = [devClean_inputPath_wav]

outputPath = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech"

for path in wav_inputPaths:
    lastPathComponent = os.path.basename(os.path.normpath(path))
    newLastPathComponent = "/"+ lastPathComponent + "-1Second"
    newOutputPath = outputPath + newLastPathComponent
    for subdir, dirs, files in os.walk(path):
        for file in files:
            piecesOfFile = file.split("-")
            speakerFolder = piecesOfFile[0]
            if file.endswith('.wav'):
                fileLocation = os.path.join(subdir, file)
                newFileLocation = os.path.join(newOutputPath + "/" + speakerFolder, file)
                path = Path(newFileLocation)
                path.parent.mkdir(parents=True, exist_ok=True) 
                new_path = Path(path.parent.as_posix() + '/' + path.stem)
                wav = wf.read(fileLocation)
                ch = wav[1]
                sr = wav[0]
                vad = VoiceActivityDetection()
                vad.process(ch)
                voice_samples = vad.get_voice_samples()
                splitSamples = list(split_seq(voice_samples, int(sr * 1)))
                for index, splitSilenceBuffers in enumerate(splitSamples):
                    filePath = "%s_%s.wav" % (str(new_path), index)
                    array = numpy.asarray(splitSilenceBuffers)
                    if (len(array) == int(sr * 1)): 
                        wf.write(filePath, sr, array)

And this is where we take 10 percent of the files in each folder, of each speaker, for testing.

devClean_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/dev-clean-wav-1Second"
devOther_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/dev-other-wav-1Second"
testClean_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/test-clean-wav-1Second"
testOther_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/test-other-wav-1Second"
trainClean_100_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/train-clean-100-wav-1Second"
trainClean_360_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/train-clean-360-wav-1Second"
trainOther_500_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/train-other-500-wav-1Second"

wav_inputPaths = [devClean_inputPath_wav]

outputPath = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech"

for path in wav_inputPaths:
    lastPathComponent = os.path.basename(os.path.normpath(path))
    newLastPathComponent = "/"+ lastPathComponent + "-testing"
    newOutputPath = outputPath + newLastPathComponent
    for subdir, dirs, files in os.walk(path):
        filesToUse = files[0::10]
        for file in filesToUse:
            piecesOfFile = file.split("-")
            speakerFolder = piecesOfFile[0]
            if file.endswith('.wav'):
                fileLocation = os.path.join(subdir, file)
                newFileLocation = os.path.join(newOutputPath + "/" + speakerFolder, file)
                path = Path(newFileLocation)
                path.parent.mkdir(parents=True, exist_ok=True) 
                os.rename(fileLocation, newFileLocation)

Once you copy and paste the folders in the ...1Second and ...1Second-testing, into a Training and Testing folder, you can use this to generate the .csv

import os, csv

f = open('./DataStore/Audio/DictationSpeakerLabels/Metadata/Dictation.csv','r+')
w = csv.writer(f)
w.writerow(['filename', 'type', 'category'])
for subdir, dirs, files in os.walk('./DataStore/Audio/5SecondFilesForSpeakersLables/Training/'):
    for file in files:
        piecesOfFile = file.split("-")
        speakerFolder = piecesOfFile[0]
        category = speakerFolder + "-dictation"
        w.writerow([file, 'train', category])

for subdir, dirs, files in os.walk('./DataStore/Audio/5SecondFilesForSpeakersLables/Testing/'):
    for file in files:
        piecesOfFile = file.split("-")
        speakerFolder = piecesOfFile[0]
        category = speakerFolder + "-dictation"
        w.writerow([file, 'test', category])  
MatthewWaller commented 4 years ago

I should note that I changed the seconds duration from 1 second to 5 seconds and everything worked great. I was even able to make the model updatable and import it into my project and let the CoreML3 sound analysis framework work it's magic and give me accurate times for when each re-trained speaker was speaking.

srikris commented 4 years ago

@MatthewWaller Great! I take it the issue was that 1 second audio clips was not long enough and after making it 5, it was an easier problem. Was this in your pre-processing script?

MatthewWaller commented 4 years ago

The pre-processing I posted was for 1 second. I had to change it in a couple of place for it to be 5 seconds. However, CreateML had no problem with the 1 second files. Strange.

TobyRoseman commented 4 years ago

This is likely related to #1712.

@MatthewWaller - are you sure the sequences were exactly one second long? It sounds like they may actually be less than 975 milliseconds.

MatthewWaller commented 4 years ago

I’m pretty sure it’s at 1 second. At least that’s what the preprocessing code I wrote above does.

TobyRoseman commented 4 years ago

I've run your code on a small subset of the data. It does seem to be generating audio of one second in length. So I don't think it's related to #1712.

It might be related to #2633. It seems with your code, data from the same speaker will all be in one continuous block. This would likely cause must mini batches to only examples from one person/label.

You could try extracting the deep feature yourself and doing the shuffling yourself:

data['deep_features'] = tc.sound_classifier.get_deep_features(data['audio'])
data = data.shuffle()

Then pass in feature='deep_features' to tc.sound_classifier.create.

ahujamanish commented 1 year ago

hi @MatthewWaller how did you create an updatable Sound Classifier model using Turicreate? I trained a model and generated model doesn't seem to be updatable. I followed the guidelines provided here Do we have to make any additional config changes to ensure that the generated model is updatable? Thanks 🙏🏽