Closed MatthewWaller closed 4 years ago
To clarify, are you getting issues with test set accuracy after training the model using the TC sound classifier?
Can you share the logs with training and test accuracy?
That is correct, I'm having issues with accuracy in training, validation, and testing.
Will do. I'm running a fresh pass to make sure I'm not going crazy, but I'll get the logs for TC as well as a screenshot of what I'm seeing for CreateML.
I can also share the data. It's publicly available LibriSpeech, I've just given it a light processing and splitting up, and generated a csv.
WARNING:root:TensorFlow version 2.0.0 detected. Last version known to be fully compatible is 1.14.0 .
WARNING:tensorflow:From /Users/matthewwaller/opt/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /Users/matthewwaller/opt/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/compat/v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Creating a validation set from 5 percent of training data. This may take a while.
You can set ``validation_set=None`` to disable validation tracking.
Preprocessing audio data -
Preprocessed 6839 of 35167 examples
Preprocessed 20463 of 35167 examples
Preprocessed 34034 of 35167 examples
Preprocessed 35167 of 35167 examples
Extracting deep features -
Extracted 6423 of 35167
Extracted 12845 of 35167
Extracted 19346 of 35167
Extracted 25844 of 35167
Extracted 32415 of 35167
Extracted 35167 of 35167
Preparing validation set
Training a custom neural network -
+-------------------------+-------------------------+-------------------------+-------------------------+
| Iteration | Training Accuracy | Validation Accuracy (%) | Elapsed Time |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 1 | 0.009 | 0.010 | 272.499 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 2 | 0.009 | 0.010 | 278.563 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 3 | 0.009 | 0.010 | 284.493 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 4 | 0.009 | 0.010 | 289.823 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 5 | 0.009 | 0.010 | 295.210 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 6 | 0.009 | 0.010 | 300.643 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 7 | 0.009 | 0.010 | 305.964 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 8 | 0.009 | 0.010 | 311.301 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 9 | 0.009 | 0.010 | 316.884 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 10 | 0.009 | 0.010 | 322.197 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 11 | 0.009 | 0.010 | 327.526 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 12 | 0.009 | 0.010 | 332.883 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 13 | 0.009 | 0.010 | 338.217 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 14 | 0.009 | 0.010 | 343.616 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 15 | 0.009 | 0.010 | 348.946 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 16 | 0.009 | 0.010 | 354.368 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 17 | 0.009 | 0.010 | 359.809 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 18 | 0.009 | 0.010 | 365.522 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 19 | 0.009 | 0.010 | 371.020 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 20 | 0.009 | 0.010 | 376.368 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 21 | 0.009 | 0.010 | 381.982 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 22 | 0.009 | 0.010 | 387.630 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 23 | 0.009 | 0.010 | 393.183 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 24 | 0.009 | 0.010 | 398.788 |
+-------------------------+-------------------------+-------------------------+-------------------------+
| 25 | 0.009 | 0.010 | 404.264 |
+-------------------------+-------------------------+-------------------------+-------------------------+
{'accuracy': 0.00946372239747634, 'auc': nan, 'precision': 0.00946372239747634, 'recall': 0.006944444444444444, 'f1_score': 0.00013020833333333336, 'log_loss': 4.975145686501957, 'confusion_matrix': Columns:
target_label int
predicted_label int
count int
Rows: 144
Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
| 1688 | 2609 | 21 |
| 1650 | 2609 | 34 |
| 237 | 2609 | 25 |
| 5142 | 2609 | 27 |
| 6267 | 2609 | 32 |
| 1188 | 2609 | 26 |
| 2033 | 2609 | 30 |
| 4294 | 2609 | 29 |
| 3576 | 2609 | 29 |
| 7729 | 2609 | 28 |
+--------------+-----------------+-------+
[144 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns., 'roc_curve': Columns:
threshold float
fpr float
tpr float
p int
n int
class int
Rows: 14600146
Data:
+-----------+-----+-----+----+------+-------+
| threshold | fpr | tpr | p | n | class |
+-----------+-----+-----+----+------+-------+
| 0.0 | 1.0 | 1.0 | 29 | 4092 | 61 |
| 1e-05 | 1.0 | 1.0 | 29 | 4092 | 61 |
| 2e-05 | 1.0 | 1.0 | 29 | 4092 | 61 |
| 3e-05 | 1.0 | 1.0 | 29 | 4092 | 61 |
| 4e-05 | 1.0 | 1.0 | 29 | 4092 | 61 |
| 5e-05 | 1.0 | 1.0 | 29 | 4092 | 61 |
| 6e-05 | 1.0 | 1.0 | 29 | 4092 | 61 |
| 7e-05 | 1.0 | 1.0 | 29 | 4092 | 61 |
| 8e-05 | 1.0 | 1.0 | 29 | 4092 | 61 |
| 9e-05 | 1.0 | 1.0 | 29 | 4092 | 61 |
+-----------+-----+-----+----+------+-------+
[14600146 rows x 6 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}
Below is what I got with the same data from CreateML
The 2 implementations are nearly identical minus the last layer training so this should not happen. This is indeed very strange. Is there any way you can share a sample of the data in a re-pro step so we can test this out and get to the bottom of this?
I noticed that you used LibriSpeech. Let me see if I can repro this.
@MatthewWaller Can you share the exact pre-processing you did on the dataset just to make sure we can repro exactly what you did.
@MatthewWaller Can you share the exact pre-processing you did on the dataset just to make sure we can repro exactly what you did.
Sure thing:
First there is this Voice Activity Detection class to drop silences
class VoiceActivityDetection:
def __init__(self):
self.__step = 10
self.__buffer_size = 10
self.__buffer = numpy.array([],dtype=numpy.int16)
self.__out_buffer = numpy.array([],dtype=numpy.int16)
self.__n = 0
self.__VADthd = 0.
self.__VADn = 0.
self.__silence_counter = 0
# Voice Activity Detection
# Adaptive threshold
def vad(self, _frame):
frame = numpy.array(_frame) ** 2.
result = True
threshold = 0.1
thd = numpy.min(frame) + numpy.ptp(frame) * threshold
self.__VADthd = (self.__VADn * self.__VADthd + thd) / float(self.__VADn + 1.)
self.__VADn += 1.
if numpy.mean(frame) <= self.__VADthd:
self.__silence_counter += 1
else:
self.__silence_counter = 0
if self.__silence_counter > 20:
result = False
return result
# Push new audio samples into the buffer.
def add_samples(self, data):
self.__buffer = numpy.append(self.__buffer, data)
result = len(self.__buffer) >= self.__buffer_size
# print('__buffer size %i'%self.__buffer.size)
return result
# Pull a portion of the buffer to process
# (pulled samples are deleted after being
# processed
def get_frame(self):
window = self.__buffer[:self.__buffer_size]
self.__buffer = self.__buffer[self.__step:]
# print('__buffer size %i'%self.__buffer.size)
return window
# Adds new audio samples to the internal
# buffer and process them
def process(self, data):
if self.add_samples(data):
while len(self.__buffer) >= self.__buffer_size:
# Framing
window = self.get_frame()
# print('window size %i'%window.size)
if self.vad(window): # speech frame
self.__out_buffer = numpy.append(self.__out_buffer, window)
# print('__out_buffer size %i'%self.__out_buffer.size)
def get_voice_samples(self):
return self.__out_buffer
Then a function we'll use to split the sample values into 1 second intervals
import itertools
def split_seq(iterable, size):
it = iter(iterable)
item = list(itertools.islice(it, size))
while item:
yield item
item = list(itertools.islice(it, size))
import itertools
import os
import scipy.io.wavfile as wf
import numpy
import sox
from pathlib import Path
Now here is where we walk the files, create a new folder for the collection of folders, and then split them up into one second files.
devClean_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/dev-clean-wav"
devOther_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/dev-other-wav"
testClean_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/test-clean-wav"
testOther_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/test-other-wav"
trainClean_100_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/train-clean-100-wav"
trainClean_360_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/train-clean-360-wav"
trainOther_500_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/train-other-500-wav"
wav_inputPaths = [devClean_inputPath_wav]
outputPath = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech"
for path in wav_inputPaths:
lastPathComponent = os.path.basename(os.path.normpath(path))
newLastPathComponent = "/"+ lastPathComponent + "-1Second"
newOutputPath = outputPath + newLastPathComponent
for subdir, dirs, files in os.walk(path):
for file in files:
piecesOfFile = file.split("-")
speakerFolder = piecesOfFile[0]
if file.endswith('.wav'):
fileLocation = os.path.join(subdir, file)
newFileLocation = os.path.join(newOutputPath + "/" + speakerFolder, file)
path = Path(newFileLocation)
path.parent.mkdir(parents=True, exist_ok=True)
new_path = Path(path.parent.as_posix() + '/' + path.stem)
wav = wf.read(fileLocation)
ch = wav[1]
sr = wav[0]
vad = VoiceActivityDetection()
vad.process(ch)
voice_samples = vad.get_voice_samples()
splitSamples = list(split_seq(voice_samples, int(sr * 1)))
for index, splitSilenceBuffers in enumerate(splitSamples):
filePath = "%s_%s.wav" % (str(new_path), index)
array = numpy.asarray(splitSilenceBuffers)
if (len(array) == int(sr * 1)):
wf.write(filePath, sr, array)
And this is where we take 10 percent of the files in each folder, of each speaker, for testing.
devClean_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/dev-clean-wav-1Second"
devOther_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/dev-other-wav-1Second"
testClean_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/test-clean-wav-1Second"
testOther_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/test-other-wav-1Second"
trainClean_100_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/train-clean-100-wav-1Second"
trainClean_360_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/train-clean-360-wav-1Second"
trainOther_500_inputPath_wav = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech/train-other-500-wav-1Second"
wav_inputPaths = [devClean_inputPath_wav]
outputPath = "/Users/matthewwaller/Developer/AI/DataStore/Audio/LibriSpeech"
for path in wav_inputPaths:
lastPathComponent = os.path.basename(os.path.normpath(path))
newLastPathComponent = "/"+ lastPathComponent + "-testing"
newOutputPath = outputPath + newLastPathComponent
for subdir, dirs, files in os.walk(path):
filesToUse = files[0::10]
for file in filesToUse:
piecesOfFile = file.split("-")
speakerFolder = piecesOfFile[0]
if file.endswith('.wav'):
fileLocation = os.path.join(subdir, file)
newFileLocation = os.path.join(newOutputPath + "/" + speakerFolder, file)
path = Path(newFileLocation)
path.parent.mkdir(parents=True, exist_ok=True)
os.rename(fileLocation, newFileLocation)
Once you copy and paste the folders in the ...1Second and ...1Second-testing, into a Training and Testing folder, you can use this to generate the .csv
import os, csv
f = open('./DataStore/Audio/DictationSpeakerLabels/Metadata/Dictation.csv','r+')
w = csv.writer(f)
w.writerow(['filename', 'type', 'category'])
for subdir, dirs, files in os.walk('./DataStore/Audio/5SecondFilesForSpeakersLables/Training/'):
for file in files:
piecesOfFile = file.split("-")
speakerFolder = piecesOfFile[0]
category = speakerFolder + "-dictation"
w.writerow([file, 'train', category])
for subdir, dirs, files in os.walk('./DataStore/Audio/5SecondFilesForSpeakersLables/Testing/'):
for file in files:
piecesOfFile = file.split("-")
speakerFolder = piecesOfFile[0]
category = speakerFolder + "-dictation"
w.writerow([file, 'test', category])
I should note that I changed the seconds duration from 1 second to 5 seconds and everything worked great. I was even able to make the model updatable and import it into my project and let the CoreML3 sound analysis framework work it's magic and give me accurate times for when each re-trained speaker was speaking.
@MatthewWaller Great! I take it the issue was that 1 second audio clips was not long enough and after making it 5, it was an easier problem. Was this in your pre-processing script?
The pre-processing I posted was for 1 second. I had to change it in a couple of place for it to be 5 seconds. However, CreateML had no problem with the 1 second files. Strange.
This is likely related to #1712.
@MatthewWaller - are you sure the sequences were exactly one second long? It sounds like they may actually be less than 975 milliseconds.
I’m pretty sure it’s at 1 second. At least that’s what the preprocessing code I wrote above does.
I've run your code on a small subset of the data. It does seem to be generating audio of one second in length. So I don't think it's related to #1712.
It might be related to #2633. It seems with your code, data from the same speaker will all be in one continuous block. This would likely cause must mini batches to only examples from one person/label.
You could try extracting the deep feature yourself and doing the shuffling yourself:
data['deep_features'] = tc.sound_classifier.get_deep_features(data['audio'])
data = data.shuffle()
Then pass in feature='deep_features'
to tc.sound_classifier.create
.
hi @MatthewWaller how did you create an updatable Sound Classifier model using Turicreate? I trained a model and generated model doesn't seem to be updatable. I followed the guidelines provided here Do we have to make any additional config changes to ensure that the generated model is updatable? Thanks 🙏🏽
I used LibriSpeech to classify the speakers. I broke up my data and trained it with CreateML, and with 147 classes, the outcome is excellent, or at least it claims 98% on testing.
However, I would like to make the model updatable, and this is something I can't do with CreateML since it uses a GLM at the end.
So I tried using the TC sound classifier since it has a neural network classifier at the end with updatable dense layers. I went through the sample project in the documentation and everything worked fine with TC's sound classifier. Then I tried to use my own data, the same I used with CreateML, and it never learns at all.
Is there something I might be missing? Is CreateML's approach super different than TC's? I've looked at both models with CoreML tools and see that they're both pipeline classifiers, and that they start with a preprocessing model, a feature abstractor, and then, for TC, the neural network, and for CreateML, the GLM.
Any recommendations on getting TC's sound classifier to work with my data? They're 16000 sample rate waves with silence removed and then divided into 1 second files.
Thanks!