BerriAI / litellm

Python SDK, Proxy Server (LLM Gateway) to call 100+ LLM APIs in OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, Replicate, Groq]
https://docs.litellm.ai/docs/
Other
13.27k stars 1.55k forks source link

[Feature]: Local embeddings through HuggingFace pipelines/SentenceTransformers #1647

Closed dhruv-anand-aintech closed 3 months ago

dhruv-anand-aintech commented 9 months ago

The Feature

Currently, I think the HuggingFace integration for embeddings relies on their free Inference API.

It'd be great to integrate with the SentencTransformers library and the 'feature-extraction' pipeline in the Transformers library to allow users to compute embeddings locally using the litellm.embedding() function.

Motivation, pitch

Same as above. Want to compute embeddings locally using the same interface as API providers.

Twitter / LinkedIn details

https://twitter.com/dhruv___anand

krrishdholakia commented 9 months ago

We support embeddings in huggingface's text-embeddings-inference format

Won't you be hosting your embedding model?

dhruv-anand-aintech commented 9 months ago

I wouldn't be hosting it as a service like HF-TEI. I just want the library to wrap SentenceTransformers.encode and transformers pipelines for feature extraction locally. Like litellm.embedding('hf/miniLM-L12....', input) Where the model runs on the same machine as where the library is being called

dhruv-anand-aintech commented 8 months ago

Would you be fine with me sending a PR for this?

krrishdholakia commented 8 months ago

sure @dhruv-anand-aintech

krrishdholakia commented 8 months ago

Hey @dhruv-anand-aintech thinking aloud - would it be better if we just exposed a custom llm class interface, for easily adding custom providers?

stephenleo commented 8 months ago

@krrishdholakia I'm wondering if there can be an interface to add a custom caching function so the user can implement whatever caching logic they want.

I'd like to implement a custom caching service with custom models. The interface to litellm could be a simple python function like below

def custom_cache(input_messages) ->Union[None, str]:
    # custom logic here
    # eg: return requests.post(custom_caching_service, input_messages)
    # response is None if no cache hit. String if successful cache hit

litellm.cache = Cache(type="custom", custom_cache_fn=custom_cache)

Happy to work together to submit a PR on this

krrishdholakia commented 8 months ago

isn't that this - https://docs.litellm.ai/docs/caching/redis_cache#custom-cache-keys? @stephenleo

stephenleo commented 8 months ago

Maybe I misunderstand... I'd like to implement my own semantic cache function, something like below

Given the messages arg from litellm. completion:

  1. Choose the message to be used to search through the cache: it can be the last message only or some transformation of the entire message history.
  2. Run the chosen message through a custom embedding generation model. Say a fine-tuned HF sentence transformer model to generate embeddings.
  3. Compare the embeddings against a custom vectorDB to extract the top 10 similar passages.
  4. Rerank these top 10 passages through a custom fine-tuned HF cross-encoder model to pick the best match.
  5. Some heuristical filtering to prevent false positives
  6. Return the answer from the best match if it passes the heuristical filtering. else, return None...

These steps should make the semantic cache much more accurate than pure dense embedding-based semantic similarity matching.

I think others might have other ideas, so having a way to overwrite the semantic similarity search logic in the semantic cache with a custom logic should help free up litellm from having to implement a lot of different methods like cross encoders, colbert, etc.

Let me know if that makes sense?

ishaan-jaff commented 8 months ago

@stephenleo this should be possible today, is this what you need ? If yes i'll add to docs

How to write custom add/get cache functions

Init Cache

from litellm.caching import Cache
cache = Cache()

Define custom add/set cache functions

def add_cache(self, result, *args, **kwargs):
  your logic

def get_cache(self, *args, **kwargs):
  your logic

Point cache add/get functions to your set / get functions

cache.add_cache = add_cache
cache.get_cache = get_cache
ishaan-jaff commented 8 months ago

Added to docs too: https://docs.litellm.ai/docs/caching/redis_cache#how-to-write-custom-addget-cache-functions

stephenleo commented 8 months ago

Perfect! Thanks. I'll test it out

RussellLuo commented 4 months ago

For those who have the same requirement, I have figured out a quick solution.

First of all, implement a HuggingFace compatible API:

# api.py

from flask import Flask, jsonify, request
from sentence_transformers import SentenceTransformer

app = Flask(__name__)
model = SentenceTransformer('<YOUR-EMBEDDING-MODEL>')

def auth(token):
    api_key = token.removeprefix('Bearer ')
    # print(api_key)
    return True

@app.route('/embeddings', methods=['POST'])
def embed():
    if not auth(request.headers.get('Authorization', '')):
        abort(401)
    data = request.get_json()
    texts = data['inputs']
    embeddings = model.encode(texts)
    return jsonify(embeddings.tolist())

if __name__ == '__main__':
    app.run(host='127.0.0.1', port=5000)

Then run the API:

python api.py

Finally, just follow the guide from LiteLLM:

from litellm import embedding
import os
os.environ['HUGGINGFACE_API_KEY'] = '<YOUR-API-KEY>'
os.environ['HUGGINGFACE_API_BASE'] = 'http://127.0.0.1:5000/embeddings'

response = embedding(
    model='huggingface/<YOUR-EMBEDDING-MODEL>', 
    input=['good morning from litellm']
)
krrishdholakia commented 3 months ago

Quick update: You can now call custom APIs within litellm (no need to spin up openai server) - https://docs.litellm.ai/docs/providers/custom_llm_server

@RussellLuo @dhruv-anand-aintech feel free to make a PR if you have an implementation that works well for you!