dotnet-smartcomponents / smartcomponents

Experimental, end-to-end AI features for .NET apps
693 stars 61 forks source link

Using multilanguage models in local embeddings #35

Closed jakubmaguza closed 7 months ago

jakubmaguza commented 7 months ago

Good day! Those components are fantastic. How can i search for ONNX models for different languages, f.e. Polish? I can see there are models for different things like image classification, but are there models for language processing using other than English language?

SteveSandersonMS commented 7 months ago

There are many language-specific sentence embeddings models on HuggingFace. The ones that include ONNX files will likely work. However rather than hunting for lots of language-specific ones, you might prefer to use a multilingual model.

If you update your SmartComponents packages to 0.1.0-preview10147, you'll have an upgraded version that is compatible with models like https://huggingface.co/Xenova/distiluse-base-multilingual-cased-v1. For example, in your csproj, add:

<PropertyGroup>
  <LocalEmbeddingsModelUrl>https://huggingface.co/Xenova/distiluse-base-multilingual-cased-v1/resolve/main/onnx/model_quantized.onnx</LocalEmbeddingsModelUrl>
  <LocalEmbeddingsVocabUrl>https://huggingface.co/Xenova/distiluse-base-multilingual-cased-v1/resolve/main/vocab.txt</LocalEmbeddingsVocabUrl>
</PropertyGroup>

I just did a small test, and it seemed to do a reasonable job of recognizing relationships between concepts in French, German, and Polish.

jakubmaguza commented 7 months ago

Thank you for this answer, you are making my life much easier since I learned about first blazor releases. Sorry for my questions, I'm not fammiliar with those models, vocab and quantizations. I found this model https://huggingface.co/distilbert/distilbert-base-multilingual-cased/tree/main lacks quantized version. Can I use 'full' version, or quantized it on my PC?

SteveSandersonMS commented 7 months ago

Quantization is entirely optional. It just makes the models smaller and faster, at the cost of some accuracy.

I don't know whether the 909 MiB full model you linked to will work, or if it does, whether the performance will be acceptable. You'll need to experiment. Just bear in mind that for sentence embeddings, larger models don't necessarily outperform smaller ones as much as you'd think. You may get satisfactory results from a ~100MiB model or even considerably smaller.

I'll close this since it seems the core question (using multilingual models) is now resolved.