dotnet-smartcomponents / smartcomponents

Experimental, end-to-end AI features for .NET apps
622 stars 54 forks source link

i18n usage #32

Closed markchipman closed 3 months ago

markchipman commented 3 months ago

I love these components... I have a multi-language application so I'm wondering if using local embeddings is the best way to handle this scenario storing the embeddings locally in the filesystem or should I use your EF example and query using against a FK that identifies the desired language for the embeddings to use. If possible, could you add to the documentation in this repos regarding usage best practices given this scenario? Perhaps, for AI's use in getting back the proper response, a data-attribute to the smart component could be added... something like "data-i18n-desired-response='es-MX'"... or is it better to somehow use data-smartpaste-description to emit other i18n field-specific instructions into the prompt for each field?

SteveSandersonMS commented 3 months ago

is it better to somehow use data-smartpaste-description to emit other i18n field-specific instructions into the prompt for each field?

Yes, that's what I'd advise. data-smartpaste-description lets you inject arbitrary field-specific information into the prompt, so if you want the results to be in a particular format, a particular language, etc., then you can specify it there.

In general the LLM will respond in the language you use when providing input. So if your clipboard text is in a particular language, and the field names or data-smartpaste-description indicates that language, it should populate the form in that language.

I'm wondering if using local embeddings is the best way to handle this scenario storing the embeddings locally in the filesystem

Embeddings are totally separate from Smart Paste so I'm assuming this is a separate question. The default embeddings model, bge-micro-v2, is very small and is optimized for English as far as I know but may well produce usable results on similar languages. Bigger embeddings models like MiniLM-L6-v2 might do better still.

If you want to produce much better results across a much wider range of languages, consider using a bigger multilingual model such as https://huggingface.co/intfloat/multilingual-e5-large which is available in ONNX format. Docs describe how to specify which model you want to use.

I'd be interested to hear how you get on with this. Please let us know!