How to embed tokenization

piotrkowalczuk commented 1 month ago

How to embed tokenization ❓

Models created using the Create ML app provide this sleek API that hides some complexity:

/// Model Prediction Input Type
@available(macOS 10.14, iOS 12.0, tvOS 12.0, watchOS 5.0, visionOS 1.0, *)
class ExampleClassifierInput : MLFeatureProvider {

    /// Input text as string value
    var text: String

    var featureNames: Set<String> { ["text"] }

    func featureValue(for featureName: String) -> MLFeatureValue? {
        if featureName == "text" {
            return MLFeatureValue(string: text)
        }
        return nil
    }

    init(text: String) {
        self.text = text
    }
}

While trying to convert the model:

def build_inference_model(weights):
    text = tf.keras.Input(shape=(), dtype=tf.string, name='text')
    input_ids = tf.keras.layers.TextVectorization(
        output_mode='int',
        output_sequence_length=MAX_LENGTH,
        vocabulary='bert_vocabulary.txt',
    )(text)
    attention_mask = tf.cast(tf.not_equal(input_ids, 0), dtype=tf.int32)

    bert = transformers.TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-cased',
        num_labels=len(class_labels))
    bert.set_weights(weights)

    predictions = bert(input_ids=input_ids, attention_mask=attention_mask)
    predicted_label = ArgmaxAndLabelMappingLayer(class_labels, name="category")(predictions.logits)

    model = tf.keras.Model(inputs=text, outputs=predicted_label, name="classifier")

    model.summary()

    return model

I encountered this error TypeError: dtype=<class 'coremltools.converters.mil.mil.types.type_str.str'> is unsupported for inputs/outputs of the model.

What do you think is the best way to handle my use case? How far the coremltools library can take me, without me having to create a Swift Package that will wrap the model?

TobyRoseman commented 1 month ago

Hi @piotrkowalczuk - I'm confused here. What are you trying to do here? Do you have a Core ML model and you're trying to get predictions from it in Python?

piotrkowalczuk commented 1 month ago

I have a Tensorflow model that I am trying to convert using the Python tooling. I found that conversion has some limitations. I want a user experience similar to what the Create ML.app offers. A similar text classifier trained using the Create ML.app, has a tokenizer included. Such model is easier to distribute.

TobyRoseman commented 1 month ago

Can you share complete code to reproduce the problem? This should include all necessary import statements and your call to ct.convert.

YifanShenSZ commented 3 weeks ago

Hi @piotrkowalczuk based on my experience we usually separate the tokenization from the main language model.

One reason is the error you encountered: TypeError: dtype=<class 'coremltools.converters.mil.mil.types.type_str.str'> is unsupported for inputs/outputs of the model. Core ML framework does not accept string as io dtype (AFAIK only float and int and bool are supported). That is to say, we do need swift driver code that instantiates a tokenizer and tokenize input string before feeding into the language model

apple / coremltools

How to embed tokenization #2371

How to embed tokenization ❓