Open MichaelMcCulloch opened 1 year ago
Using separate encoder and decoder tflite models would be a better option for multilanguage support with tflite model. For Android APP development, you can refer to the code available at https://github.com/ipsilondev/whisper-cordova/blob/36bf04192c05bbb9aa24d3dfb553039b4aeaa755/android/cpp/native-lib.cpp.
To exercise this feature using Python, you can use the colab notebook available at https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/whisper_encoder_decoder_tflite.ipynb.
Based on my understanding, it is still acceptable to run upto small whisper models on mobile phones.
In 849517e You can see my attempt to use the separated encoder decoder model. Very slow so I assume I'm doing something wrong, or maybe that's just the way it is.
@nyadla-sys I have another problem I think you have the domain knowledge to help me with, if you also have the time.
class GenerateModel(tf.Module):
def __init__(self, model, prefix):
super(GenerateModel, self).__init__()
self.model = model
self.prefix = prefix
@tf.function(
input_signature=[
tf.TensorSpec(shape=(1, 80, 3000), dtype=tf.float32, name="input_features"),
],
)
def inference(self, input_features):
outputs = self.model.generate(
input_ids=input_features,
max_new_tokens=384,
return_dict_in_generate=True,
forced_decoder_ids=self.prefix,
)
return {"sequences": outputs["sequences"]}
@tf.function(
input_signature=[
tf.TensorSpec(shape=(4,2), dtype=tf.int32, name="decoder_prefix"),
],
)
def set_prefix(self, decoder_prefix):
self.prefix = decoder_prefix
return {"discard": decoder_prefix}
I hope it's clear from this code that
wrt the second point, the goal is to instruct it to translate/transcribe languages and configure timestamps. This is not working, I don't know why, but I suspect the value of forced_decoder_ids=self.prefix
ends up effectively hard-coded, because calls to set_prefix have no effect on the decoder output. I think this is a product of the way the graph is built, but I don't know enough to fix it.
Currently, the translation and transcription of audio works only with encoder and decoder TFLite models. You can find a Python implementation of this process in the following Colab notebook: https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/whisper_encoder_decoder_tflite.ipynb.
Furthermore, there is an ongoing effort to add this feature to an Android app using C++ through the following GitHub link: https://github.com/ipsilondev/whisper-cordova/blob/main/android/cpp/native-lib.cpp.
It should be noted that, as of now, none of the TFLite models support timestamps. However, this feature may be added at a later time.
at this time none of my tflite models support timestamps and it can be added layter
Before we can do this, we need to run inference on faster hardware. The larger models take significantly longer to run on the cpu, and I've yet to be able to measure the performance of larger models on specialized hardware, ie via NNAPI.