HumanSignal / label-studio

Label Studio is a multi-type data labeling and annotation tool with standardized output format
https://labelstud.io
Apache License 2.0
18.79k stars 2.34k forks source link

Persistency when running the ML backend #590

Open Yannik1337 opened 3 years ago

Yannik1337 commented 3 years ago

I am using LS to label audio data. Is use a local file dir, and access the audio files in the fit method as (sample code)

for k in completions:
          print("Choice: {}".format(k['completions'][-1]['result'][0]['value']['choices']))
          print("for file {}".format(k['task_path']))
          y, sr = librosa.load(k['task_path'])
          print(y.shape, sr)

I noticed that the script gets re-run all over again when starting the training. Loading the audio data is time-expensive, as is creating/loading a model. Is there a way to enable some kind of persistency?

I think of holding the audio files in RAM (I can provide enough RAM if necessary), for example in a numpy array. To this I would dynamically add any newly loaded audio file. I'd then check, during the next training whether I have previously loaded the file (via a dictionary, or something else).

A similar thing would be convenient for model handling. Instead of reloading, keep it in memory.

makseq commented 3 years ago

@Yannik1337 It's a great idea, thanks! Our model SDK and ML backend examples are designed to be simple for understanding, so, it's not the final point of the story and we are planning to write more efficient code. :slightly_smiling_face: It would be great and highly appreciated if you could contribute something like this one!

Yannik1337 commented 3 years ago

I have experimented further, and will list my solution/approaches for other users here. They don't require any changes made to the LS package.

First, to prevent frequent re-instantiation of the model, we can use a global object MODEL. In our fit, predict, and any other relevant method, we access this global object. We do not need a self.model attribute.

Second, to implement caching, we can use the same approach, use a global dict, or any other data structure, MEMORY. See the function load_file for an outline.

When we have limited RAM, we have two options:

  1. Limit the number of items in the cache; use the cache as FIFO queue and continuously replace older entries
  2. Parallelize data loading (I assume I/O to be the bottleneck) and use no cache at all

For unbalanced datasets, we can extend our memory to and array of dicts MEMORY[n_classes], or use MEMORY_1, MEMORY_2, one per class. For the dominant classes, we then only keep n samples, whereas for the minor classes we try to keep all samples loaded. This ensures that the network sees all under-represented classes, while we actually don't care which samples from the major classes it sees, since there are too many anyway.

Lastly, we can combine this with weighting:

#for two classes, neg is the number of negative (class 0) samples, pos the number of positive (class 1) samples
weight_for_0 = (1 / neg)*(samples)/
weight_for_1 = (1 / pos)*(samples)/
class_weight = {0: weight_for_0, 1: weight_for_1}
from tensorflow.keras import layers
from tensorflow import keras
import tensorflow as tf
#------for kapre----------------#

from label_studio.ml import LabelStudioMLBase
import random
import librosa
import numpy as np
import json

def get_model():
      #create and return model

METRICS = [
      keras.metrics.TruePositives(name='tp'),
      keras.metrics.FalsePositives(name='fp'),
      keras.metrics.TrueNegatives(name='tn'),
      keras.metrics.FalseNegatives(name='fn'), 
      keras.metrics.BinaryAccuracy(name='accuracy'),
      keras.metrics.Precision(name='precision'),
      keras.metrics.Recall(name='recall'),
      keras.metrics.AUC(name='auc'),
]

MODEL = get_model(...)
MODEL.compile('adam', keras.losses.BinaryCrossentropy(), steps_per_execution=20, metrics=METRICS)

MEMORY = {} # initially empty

class BaseClassifier(LabelStudioMLBase):

    def __init__(self, **kwargs):
        super(BaseClassifier, self).__init__(**kwargs)

        from_name, schema = list(self.parsed_label_config.items())[0]
        self.from_name = from_name
        self.to_name = schema['to_name'][0]
        self.labels = schema['labels']

    def predict(self, tasks, **kwargs):
        #as usual

    def load_file(self, file_id, file_path):
      if file_id in MEMORY:
        data = MEMORY[file_id]
        print(f"Cached: {file_id}")
      else:
        print(f"Not cached: {file_id}")
        data = ... #load data and cache
        MEMORY[file_id] = data
      return data

    def fit(self, completions, workdir=None, **kwargs):
        print("Running fit")
        # generate data, etc.
        MODEL.fit(...)