CodeWithKyrian / transformers-php

Transformers PHP is a toolkit for PHP developers to add machine learning magic to their projects easily.
https://codewithkyrian.github.io/transformers-php/
Apache License 2.0
569 stars 30 forks source link

Using embeddings models #59

Closed spaceworkplatform closed 2 months ago

spaceworkplatform commented 3 months ago

Your question

Hey, First of all great work , this library is exactly what I was looking for. One thing that can be awesome is the ability to use embeddings models, we all know that open source models can be better then ada-2 and as php developer using this library to generate embeddings with open source models can be awesome. I know gte-small for example is supported by transformers.js so its only make sense to use it server side as well with php.

Any option to activate embeddings models as is or its possible with some work needed ?

Context (optional)

No response

Reference (optional)

No response

CodeWithKyrian commented 3 months ago

Hey @spaceworkplatform,

Thank you for the kind words. I'm glad to hear that the library is helping you out.

Regarding your request for embeddings models, this feature is already available in TransformersPHP through feature extraction, which essentially provides the embeddings you’re looking for. You can easily use an embeddings model with the following example:

use function Codewithkyrian\Transformers\Pipelines\pipeline;

$extractor = pipeline('embeddings', 'Xenova/all-MiniLM-L6-v2');
// or $extractor = pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
$embeddings = $extractor('The quick brown fox jumps over the lazy dog.', normalize: true, pooling: 'mean');

You can replace Xenova/all-MiniLM-L6-v2 with any other embedding model repository with ONNX weights, and it will work seamlessly. Here are a few popular models with ONNX weights that you might find useful:

For more details on how to use feature extraction in TransformersPHP, you can check out the documentation here: TransformersPHP Feature Extraction.

Let me know if you have any more questions or if there's anything else you'd like to see in the library!

spaceworkplatform commented 3 months ago

@CodeWithKyrian Thank you, didnt notice that on the docs. I got error trying to predownload a model:

Downloading model: onnx-community/gte-multilingual-base ✔ Initializing download... ✔ Downloading tokenizer.json : [••••••••••••••••••••••••••••] 100% ✔ Downloading tokenizer_config.json : [••••••••••••••••••••••••••••] 100% ✔ Downloading config.json : [••••••••••••••••••••••••••••] 100% Unknown model class for model type new. Using base class PreTrainedModel. ✔ Downloading model_quantized.onnx : [••••••••••••••••••••••••••••] 100% ✘ Load model from .transformers-cache/onnx-community/gte-multilingual-base/onnx/model_quantized.onnx failed:/Users/runner/work/1/s/onnxruntime/core/graph/model.cc:180 onnxruntime::Model::Model(ModelProto &&, const PathString &, const IOnnxRuntimeOpSchemaRegistryList *, const logging::Logger &, const ModelOptions &) Unsupported model IR version: 10, max supported IR version: 9

Failed to download the model: The command "'./vendor/bin/transformers' 'download' 'onnx-community/gte-multilingual-base' '--quantized=true'" failed.

Exit Code: 1(General error)

Working directory: /Users/netaneledri/Dev/Sites/myspace-api

Output:

Error Output:

spaceworkplatform commented 3 months ago

@CodeWithKyrian I succeed to download Xenova/all-MiniLM-L6-v2 for example, so looks like specific issue. My issue is I need embedding models support Hebrew language and I having hard time to find one, alot of models dont say exactly what languages they support. Any suggestions ?

CodeWithKyrian commented 3 months ago

Hi @spaceworkplatform,

To be honest, I don't have much experience working with non-English languages, and I understand how challenging it can be to find models that support specific languages like Hebrew. The most reliable way to determine if a model supports Hebrew is by checking the model card on Hugging Face or looking up any additional resources provided by the organization behind the model. Another method is to inspect the model's vocabulary (found in the tokenizer.json) to see if Hebrew characters are included, which can suggest the model has been trained to understand Hebrew.

That said, I took some time to research models fine-tuned on Hebrew datasets and came across a creator on Hugging Face, avichr, who has a collection of models for various tasks in Hebrew. However, there's no specific feature extraction model available.

To help out, I converted the heBERT sentiment analysis model to ONNX and tested it for feature extraction. It performed well in my tests (using Google Translate to generate Hebrew sentences). You can try it out with this code:

use function Codewithkyrian\Transformers\Pipelines\pipeline;

$extractor = pipeline('feature-extraction', 'codewithkyrian/heBERT_sentiment_analysis');
$result = $extractor('אני אוהב לקרוא ספרים', pooling: 'mean');

You can check out the model here. I hope this helps with your project