[Bug]: can't use huggingface embeddings via locally hosted `text-embeddings-inference`, fails on getting `model_info`

What happened?

Using config

model_list:
  - model_name: bge-large-en-v1.5
    litellm_params:
      model: huggingface/BAAI/bge-large-en-v1.5
      api_base: http://localhost:8006/
      api_key: EMPTY
      max_parallel_requests: 16

which connects to:

docker run --gpus all -p 8006:80 -v embedding:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.5 --model-id 'BAAI/bge-large-en-v1.5'

Then, trying the litellm-wrapper around the embedding api:

curl -X 'POST'   'https://llm.hpc.rug.nl/v1/embeddings?model=bge-large-en-v1.5'   -H 'accept: application/json'   -H 'x-api-key: sk-abcd'   -d '{"input": "Yes"}'

Results in this line crashing:

litellm.APIConnectionError: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/litellm/main.py", line 3140, in aembedding
    response = await init_response  # type: ignore
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/litellm/llms/huggingface_restapi.py", line 1050, in aembedding
    data = self._transform_input(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/litellm/llms/huggingface_restapi.py", line 950, in _transform_input
    hf_task = get_hf_task_embedding_for_model(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/litellm/llms/huggingface_restapi.py", line 285, in get_hf_task_embedding_for_model
    model_info_dict = model_info.json()
                      ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_models.py", line 766, in json
    return jsonlib.loads(self.content, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Received Model Group=bge-large-en-v1.5
Available Model Group Fallbacks=None

Which seems like litellm connects to the base url (api_base), which indeed gives an empty response (should this be {api_base}/info ?)

However, text-embeddings-inference should be supported according to this doc

Relevant log output

(repeated)

litellm.APIConnectionError: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/litellm/main.py", line 3140, in aembedding
    response = await init_response  # type: ignore
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/litellm/llms/huggingface_restapi.py", line 1050, in aembedding
    data = self._transform_input(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/litellm/llms/huggingface_restapi.py", line 950, in _transform_input
    hf_task = get_hf_task_embedding_for_model(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/litellm/llms/huggingface_restapi.py", line 285, in get_hf_task_embedding_for_model
    model_info_dict = model_info.json()
                      ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_models.py", line 766, in json
    return jsonlib.loads(self.content, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Received Model Group=bge-large-en-v1.5
Available Model Group Fallbacks=None



### Twitter / LinkedIn details

https://www.linkedin.com/in/herbertk

Because this is a breaking issue for us, I did the following to monkey-patch it:

Injected a patched def get_hf_task_embedding_for_model in litellm/llms/huggingface_restapi.py into the container, e.g. updating get_hf_task_embedding_for_model with a try-except, since the model_info_dict.get defaults to None anyway.

I didn't understand the logging-system, i.e. this didn't show up in the logs after setting LITELLM_LOG=DEBUG . Other debug info was shown though, so the variable did come through. I noticed there are different loggers, e.g. verbose_logger, but couldn't find anything about that in the docs. So I simply debugged to a /oops.log file. This verrified that api_base indeed is / on the text-embeddings-inference instance, i.e. http://localhost:8006/, and the response is indeed empty.

def get_hf_task_embedding_for_model(
    model: str, task_type: Optional[str], api_base: str
) -> Optional[str]:
    if task_type is not None:
        if task_type in get_args(hf_tasks_embeddings):
            return task_type
        else:
            raise Exception(
                "Invalid task_type={}. Expected one of={}".format(
                    task_type, hf_tasks_embeddings
                )
            )
    http_client = HTTPHandler(concurrent_limit=1)

    model_info = http_client.get(url=api_base)
#    with open('/oops.log', 'a') as f:
#        f.write('api_base: ')
#        f.write(api_base)
#        f.write('\ntext: \n')
#        f.write(model_info.text)
#        f.write('\n\n')
    try:
        model_info_dict = model_info.json()

        pipeline_tag: Optional[str] = model_info_dict.get("pipeline_tag", None)
    except:
        pipeline_tag = None
    return pipeline_tag

Using Dockerfile:

# syntax=docker/dockerfile:1

FROM ghcr.io/berriai/litellm:main-v1.52.0

COPY ./huggingface_restapi.py /usr/local/lib/python3.11/site-packages/litellm/llms/huggingface_restapi.py

Now everything works, but of course, this is just a patch. I couldn't find out what pipeline_tag is supposed to reference to, so I also couldn't do better than to just ditch it (sorry).

Let me know if I can do any more debugging for you.

BerriAI / litellm

[Bug]: can't use huggingface embeddings via locally hosted `text-embeddings-inference`, fails on getting `model_info` #6711

What happened?

Relevant log output