Open Yannik1337 opened 3 years ago
@Yannik1337 It's a great idea, thanks! Our model SDK and ML backend examples are designed to be simple for understanding, so, it's not the final point of the story and we are planning to write more efficient code. :slightly_smiling_face: It would be great and highly appreciated if you could contribute something like this one!
I have experimented further, and will list my solution/approaches for other users here. They don't require any changes made to the LS package.
First, to prevent frequent re-instantiation of the model, we can use a global object MODEL
. In our fit, predict, and any other relevant method, we access this global object. We do not need a self.model attribute.
Second, to implement caching, we can use the same approach, use a global dict, or any other data structure, MEMORY
. See the function load_file
for an outline.
When we have limited RAM, we have two options:
For unbalanced datasets, we can extend our memory to and array of dicts MEMORY[n_classes]
, or use MEMORY_1
, MEMORY_2
, one per class. For the dominant classes, we then only keep n
samples, whereas for the minor classes we try to keep all samples loaded. This ensures that the network sees all under-represented classes, while we actually don't care which samples from the major classes it sees, since there are too many anyway.
Lastly, we can combine this with weighting:
#for two classes, neg is the number of negative (class 0) samples, pos the number of positive (class 1) samples
weight_for_0 = (1 / neg)*(samples)/
weight_for_1 = (1 / pos)*(samples)/
class_weight = {0: weight_for_0, 1: weight_for_1}
from tensorflow.keras import layers
from tensorflow import keras
import tensorflow as tf
#------for kapre----------------#
from label_studio.ml import LabelStudioMLBase
import random
import librosa
import numpy as np
import json
def get_model():
#create and return model
METRICS = [
keras.metrics.TruePositives(name='tp'),
keras.metrics.FalsePositives(name='fp'),
keras.metrics.TrueNegatives(name='tn'),
keras.metrics.FalseNegatives(name='fn'),
keras.metrics.BinaryAccuracy(name='accuracy'),
keras.metrics.Precision(name='precision'),
keras.metrics.Recall(name='recall'),
keras.metrics.AUC(name='auc'),
]
MODEL = get_model(...)
MODEL.compile('adam', keras.losses.BinaryCrossentropy(), steps_per_execution=20, metrics=METRICS)
MEMORY = {} # initially empty
class BaseClassifier(LabelStudioMLBase):
def __init__(self, **kwargs):
super(BaseClassifier, self).__init__(**kwargs)
from_name, schema = list(self.parsed_label_config.items())[0]
self.from_name = from_name
self.to_name = schema['to_name'][0]
self.labels = schema['labels']
def predict(self, tasks, **kwargs):
#as usual
def load_file(self, file_id, file_path):
if file_id in MEMORY:
data = MEMORY[file_id]
print(f"Cached: {file_id}")
else:
print(f"Not cached: {file_id}")
data = ... #load data and cache
MEMORY[file_id] = data
return data
def fit(self, completions, workdir=None, **kwargs):
print("Running fit")
# generate data, etc.
MODEL.fit(...)
I am using LS to label audio data. Is use a local file dir, and access the audio files in the
fit
method as (sample code)I noticed that the script gets re-run all over again when starting the training. Loading the audio data is time-expensive, as is creating/loading a model. Is there a way to enable some kind of persistency?
I think of holding the audio files in RAM (I can provide enough RAM if necessary), for example in a
numpy
array. To this I would dynamically add any newly loaded audio file. I'd then check, during the next training whether I have previously loaded the file (via a dictionary, or something else).A similar thing would be convenient for model handling. Instead of reloading, keep it in memory.