enjalot / latent-scope

A scientific instrument for investigating latent spaces
MIT License
565 stars 19 forks source link

Add more embedding models #14

Closed enjalot closed 2 months ago

enjalot commented 7 months ago

In addition to potentially interesting open source models, we should add deprecated models like OpenAI ada-002 and each specific preview version. In supporting #13 we would want to let people import embeddings they already made, and there are likely tons of embeddings made with the older models.

enjalot commented 6 months ago

Thanks to @dhruv-anand-aintech for pointing out LiteLLM as a possible way of adding support for many more providers/models: https://litellm.vercel.app/docs/embedding/supported_embedding https://litellm.vercel.app/docs/providers

enjalot commented 5 months ago

We should also look into allowing the user to freeform add a huggingface transfomer model. I'm imagining the model dropdown would have a "Custom" option, which would then display a text input for the huggingface model address, i.e. Snowflake/snowflake-arctic-embed-m-long and then have some inputs to choose max-tokens (context length) and the pooling method (essentially whats currently input via the json config https://github.com/enjalot/latent-scope/blob/main/latentscope/models/embedding_models.json

arnicas commented 3 months ago

Hi - a vote for custom. I also have to admit I sometimes want glove/word2vec models too, for some kinds of text grouping problems. seems like it would be easy to do?

enjalot commented 3 months ago

@arnicas would implementing something like this with pre trained glove models? https://www.geeksforgeeks.org/pre-trained-word-embedding-using-glove-in-nlp-models/

I'm now considering using sentence-transformers and powering the embedding choice via a HuggingFace hub API search

enjalot commented 2 months ago

I've made some progress switching to sentence_transformers and enabling backend support for passing in any huggingface model id. Next I'll work on an advanced dropdown that lets you search the HF api or pick from available 3rd party (and even recently used models)

enjalot commented 2 months ago

I just pushed support for searching the HuggingFace hub for any sentence transformer model (defaulting to showing 5 most downloaded) instead of needing to preconfigure each available model. closing this issue as its mostly supported via the new 0.4.0 release

I will open a new one with your request @arnicas