MichaelMcCulloch / WhisperVoiceKeyboard

A voice to text keyboard based on OpenAI Whisper Model.
MIT License
45 stars 5 forks source link

Use larger Whisper Model #11

Open MichaelMcCulloch opened 1 year ago

MichaelMcCulloch commented 1 year ago

Before we can do this, we need to run inference on faster hardware. The larger models take significantly longer to run on the cpu, and I've yet to be able to measure the performance of larger models on specialized hardware, ie via NNAPI.

nyadla-sys commented 1 year ago

Using separate encoder and decoder tflite models would be a better option for multilanguage support with tflite model. For Android APP development, you can refer to the code available at https://github.com/ipsilondev/whisper-cordova/blob/36bf04192c05bbb9aa24d3dfb553039b4aeaa755/android/cpp/native-lib.cpp.

To exercise this feature using Python, you can use the colab notebook available at https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/whisper_encoder_decoder_tflite.ipynb.

Based on my understanding, it is still acceptable to run upto small whisper models on mobile phones.

MichaelMcCulloch commented 1 year ago

In 849517e You can see my attempt to use the separated encoder decoder model. Very slow so I assume I'm doing something wrong, or maybe that's just the way it is.

@nyadla-sys I have another problem I think you have the domain knowledge to help me with, if you also have the time.

class GenerateModel(tf.Module):
  def __init__(self, model, prefix):
    super(GenerateModel, self).__init__()
    self.model = model
    self.prefix = prefix

  @tf.function(
    input_signature=[
      tf.TensorSpec(shape=(1, 80, 3000), dtype=tf.float32, name="input_features"), 
    ],
  )
  def inference(self, input_features):

    outputs = self.model.generate(
      input_ids=input_features,
      max_new_tokens=384,
      return_dict_in_generate=True,
      forced_decoder_ids=self.prefix,
    )
    return {"sequences": outputs["sequences"]}

  @tf.function(
    input_signature=[
      tf.TensorSpec(shape=(4,2), dtype=tf.int32, name="decoder_prefix"), 
    ],
  )
  def set_prefix(self, decoder_prefix):
    self.prefix = decoder_prefix
    return {"discard": decoder_prefix}

I hope it's clear from this code that

  1. I'm no tensorflow, or even python expert
  2. I'm trying to make it possible to use the generation enabled model while Allowing the decoder inputs to be set via a tensor.
  3. It's not working

wrt the second point, the goal is to instruct it to translate/transcribe languages and configure timestamps. This is not working, I don't know why, but I suspect the value of forced_decoder_ids=self.prefix ends up effectively hard-coded, because calls to set_prefix have no effect on the decoder output. I think this is a product of the way the graph is built, but I don't know enough to fix it.

nyadla-sys commented 1 year ago

Currently, the translation and transcription of audio works only with encoder and decoder TFLite models. You can find a Python implementation of this process in the following Colab notebook: https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/whisper_encoder_decoder_tflite.ipynb.

Furthermore, there is an ongoing effort to add this feature to an Android app using C++ through the following GitHub link: https://github.com/ipsilondev/whisper-cordova/blob/main/android/cpp/native-lib.cpp.

It should be noted that, as of now, none of the TFLite models support timestamps. However, this feature may be added at a later time.

at this time none of my tflite models support timestamps and it can be added layter