Using same model for vision/text embeddings

karlomikus commented 2 months ago

Your question

Hello,

Is it possible to use the same model to generate vision and text embeddings. Seems like models like CLIP and SigLIP should support this but using pipelines like this:

<?php
$modelName = 'Xenova/clip-vit-base-patch32';

$extractor = pipeline('feature-extraction', $modelName);
$embeddings = $extractor('A man with a hat');

Returns an error: Warning: Undefined array key "pixel_values" in /var/www/app/vendor/codewithkyrian/transformers/src/Models/ModelArchitecture.php on line 86

Excellent package btw.

Context (optional)

No response

Reference (optional)

No response

CodeWithKyrian commented 2 months ago

Oh, thanks for the nice words @karlomikus . I understand why you'd want to use the same model for generating both vision and text embeddings, especially with models like CLIP and SigLIP. However, let me clarify a few things about how these models and pipelines work in TransformersPHP.

CLIP and similar models are designed as multimodal models, which means they can handle both image and text inputs. However, when you use a pipeline like feature-extraction, it's specifically designed to work with models trained for text inputs only. This model expects both, that's why you're getting those errors.

The current structure of the feature extraction pipeline doesn't account for models that might expect an image input as well. So, while CLIP and other similar models are multimodal, utilizing them for text-only feature extraction isn't straightforward and possible now.

I'll look into the possibility of supporting these types of models in the nearest future and will mention this issue if it becomes part of the library. In the meantime, I recommend exploring other smaller models trained specifically for text feature extraction for your application.

I hope this helps! Let me know if you have any other questions.

karlomikus commented 2 months ago

Thanks for the response, makes sense. Looking forward to more features in future.

Here's what I had so far, which kinda "works" if someone stumbles upon this.

// Model config needs "processor_class" key removed to fallback to default image processor
$modelName = 'Xenova/siglip-base-patch16-224';
$config = AutoConfig::fromPretrained($modelName);
$image = Image::read('/var/www/app/src/test.jpg');

$textModel = SiglipTextModel::fromPretrained($modelName, false, $config);
$textTokenizer = PretrainedTokenizer::fromPretrained($modelName);
$textInputs = $textTokenizer('A man with a hat', padding: true, truncation: true);
$textOutputs = $textModel($textInputs);
$textEmbeddings = $textOutputs["last_hidden_state"] ?? $textOutputs["logits"];

$visionModel = SiglipVisionModel::fromPretrained($modelName, false, $config);
$visionProcessor = AutoProcessor::fromPretrained($modelName);
$visionInputs = $visionProcessor($image);
$visionOutput = $visionModel($visionInputs);
$visionEmbeddings = $visionOutput['last_hidden_state']->toArray();

CodeWithKyrian commented 2 months ago

Great! This is a perfect use case for using the models directly. The pipelines are just there to make these steps easy, albeit sacrificing some flexibility.

CodeWithKyrian / transformers-php