HuggingFace TGI Codellama support

taoari commented 1 year ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

HuggingFace TGI is a standard way to serve LLMs. Is it possible to add support for HuggingFace TGI served codellama models?

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

sestinj commented 1 year ago

Hi @taoari, I've started work on this here, but haven't yet added it to the documentation. This hasn't been tested yet—there is a chance it already works, but might require a bit of debugging.

sestinj commented 1 year ago

Usage would look like this:

from continuedev.src.continuedev.libs.llm.ht_tgi import HuggingFaceTGI
...
config=ContinueConfig(
  ...
  models=Models(
    default=HuggingFaceTGI(server_url="<SERVER_URL>")
  )
)

I encountered friction installing TGI on my mac, which is why I haven't fully tested yet, so would be super helpful for me if you wanted to give it a try

taoari commented 1 year ago

@sestinj I got the following error

ModuleNotFoundError: No module named 'continuedev.src.continuedev.libs.llm.ht_tgi'

sestinj commented 1 year ago

Just a typo, it should be hf_tgi

Can check the file here to be sure: https://github.com/continuedev/continue/blob/main/continuedev/src/continuedev/libs/llm/hf_tgi.py

taoari commented 1 year ago

@sestinj No errors this time. But it still does not work. The "Play" button blinks all the time, and get no response.

taoari commented 1 year ago

@sestinj I think it crashed on my computer.

I did uninstall, lsof -i :65432 | grep "(LISTEN)" | awk '{print $2}' | xargs kill -9 delete ~/.continue reinstall

It still does not work, I always got "Continue Server Starting".

sestinj commented 1 year ago

Ok, I might just need to go back and test this myself then. I'll update you when it's ready.

Is Continue completely unable to start up again? In worst case I think that uninstalling Continue and restarting VS Code should solve things.

Another way to make sure that no servers are running is just lsof -i :65432

You can check the logs with cmd+shift+p "View Continue Server Logs"

abhinavkulkarni commented 1 year ago

I set up a local instance of TGI and added it in config.py as follows:

from continuedev.src.continuedev.libs.llm.ht_tgi import HuggingFaceTGI
...
config=ContinueConfig(
  ...
  models=Models(
    default=HuggingFaceTGI(server_url="http://localhost:8080")
  )
)

Please note, I am able to successfully obtain responses from /info, /generate and /generate_stream endpoints of TGI.

If I type a simple prompt in Continue box, I get the following error:

Traceback (most recent call last):

  File "continuedev/src/continuedev/libs/util/create_async_task.py", line 21, in callback
    future.result()

  File "asyncio/futures.py", line 203, in result

  File "asyncio/tasks.py", line 267, in __step

  File "continuedev/src/continuedev/core/autopilot.py", line 543, in create_title
    title = await self.continue_sdk.models.medium.complete(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "continuedev/src/continuedev/libs/llm/__init__.py", line 258, in complete
    completion = await self._complete(prompt=prompt, options=options)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "continuedev/src/continuedev/libs/llm/__init__.py", line 334, in _complete
    async for chunk in self._stream_complete(prompt=prompt, options=options):

  File "/var/folders/nw/hfwjfm7n6h13ybsw6kxqh08w0000gn/T/_MEI1quhS4/continuedev/src/continuedev/libs/llm/hf_tgi.py", line 55, in _stream_complete
    json_chunk = json.loads(chunk)
                 ^^^^^^^^^^^^^^^^^

  File "json/__init__.py", line 346, in loads

  File "json/decoder.py", line 337, in decode

  File "json/decoder.py", line 355, in raw_decode

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

sestinj commented 1 year ago

@abhinavkulkarni I've just released a new version that I think will fix this. It was a very obvious mistake on our end

abhinavkulkarni commented 1 year ago

Thanks @sestinj, I now get a new error:

Traceback (most recent call last):

  File "continuedev/src/continuedev/libs/util/create_async_task.py", line 21, in callback
    future.result()

  File "asyncio/futures.py", line 203, in result

  File "asyncio/tasks.py", line 267, in __step

  File "continuedev/src/continuedev/core/autopilot.py", line 543, in create_title
    title = await self.continue_sdk.models.medium.complete(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "continuedev/src/continuedev/libs/llm/__init__.py", line 258, in complete
    completion = await self._complete(prompt=prompt, options=options)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "continuedev/src/continuedev/libs/llm/__init__.py", line 334, in _complete
    async for chunk in self._stream_complete(prompt=prompt, options=options):

  File "/var/folders/nw/hfwjfm7n6h13ybsw6kxqh08w0000gn/T/_MEISzqrqf/continuedev/src/continuedev/libs/llm/hf_tgi.py", line 41, in _stream_complete
    args = self.collect_args(options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/var/folders/nw/hfwjfm7n6h13ybsw6kxqh08w0000gn/T/_MEISzqrqf/continuedev/src/continuedev/libs/llm/hf_tgi.py", line 37, in collect_args
    args.pop("functions")

KeyError: 'functions'

If I comment this line out, I get Error parsing JSON: Expecting value: line 1 column 1 (char 0) error.

Please note, I get a successful response from my local TGI setup:

$ curl http://localhost:8080/generate     -X POST     -d '{"inputs":"Write a hello world Python program","parameters":{"max_n
ew_tokens":512}}'     -H 'Content-Type: application/json' | jq ".generated_text" -rc | cat

def main():
    print("Hello World")

if __name__ == "__main__":
    main()

sestinj commented 1 year ago

The "functions" error is an easy one. Let me give the other a deeper look and set TGI up on my own machine (embarrassing, but I haven't gotten to this yet, I was just following the API documentation). I think it might be something about how I'm calling the streaming endpoint.

The request I'm making right now is the equivalent of

curl -X POST -H "Content-Type: application/json" -d '{"inputs": "<prompt_value>", "parameters": {"max_new_tokens": 1024}}' http://localhost:8080/generate_stream

sestinj commented 1 year ago

resuming work in the morning, has been a slight pain to setup TGI on Mac.

If there's any chance you've seen this error would be curious how you solved it. Otherwise sure I'll get it tmr

RuntimeError: An error occurred while downloading using `hf_transfer`. Consider disabling HF_HUB_ENABLE_HF_TRANSFER for better error handling.

sestinj commented 1 year ago

@abhinavkulkarni finally got it, and successfully tested on my own TGI setup. Let me know if any problems still

and @taoari this should solve your error as well

abhinavkulkarni commented 1 year ago

Thanks @sestinj, local TGI setup works and I can generate responses from it.

However, I am not able to feed it context by selecting code, please see the attached video. You can see responses being generated from the TGI in the integrated terminal window. Please note, when I switch to OpenAI maybe proxy, it does work and is able to answer questions based on the highlighted context.

output

abhinavkulkarni commented 1 year ago

Also, for Llama 2 models, </s> is a special token that indicates end of text/sequence and should not be displayed.

You can see in the following attached image, that is is shown for the title.

sestinj commented 1 year ago

@abhinavkulkarni I have a suspicion that the code is in the prompt, but the model is ignoring it. If you try this again and hover over the response, a magnifying glass button will show up. Clicking that shows the full prompts/completions as sent to the LLM. Could you share what that looks like?

We have a stop parameter that can be set for the model, but since CodeLlama/Llama is usually the model people use, I think it would be sensible to have \ as the default there. Also noticing the [PYTHON] tags are probably a bit annoying. I'll make a change so they are converted to triple backticks

abhinavkulkarni commented 1 year ago

Thanks, @sestinj, here's a video screengrab for a simple prompt. This is the full prompt and the response:

This is a log of the prompt/completion pairs sent/received from the LLM during this step

############################################

Prompt: 

[INST] Tell me what this code is doing.
[/INST]

############################################

Completion: 

 This code is using the `requests` library to make a GET request to the URL `https://api.github.com/users/octocat/repos`. The `json()` method is used to parse the response as JSON data, and the `for` loop is used to iterate over the list of repositories returned in the response.

For each repository, the code is printing the repository name and the number of stars it has. The `print()` function is used to display the output.

This code is using the GitHub API to retrieve a list of repositories for the user "octocat" and then printing the name and number of stars for each repository..</s>

############################################

Prompt: 

[INST] " This code is using the `requests` library to make a GET request to the URL `https://api.github.com/users/octocat/repos`. The `json()` method is used to parse the response as JSON data, and the `for` loop is used to iterate over the list of repositories returned in the response.

For each repository, the code is printing the repository name and the number of stars it has. The `print()` function is used to display the output.

This code is using the GitHub API to retrieve a list of repositories for the user "octocat" and then printing the name and number of stars for each repository..</s>"

Please write a short title summarizing the message quoted above. Use no more than 10 words:
[/INST]

############################################

Completion: 

 ""The only way to do great work is to love what you do." ― Steve Job

output

sestinj commented 1 year ago

Thanks. My suspicion is wrong... but...I see the problem! This is actually fixable through the config file, but I'll change the default to be the correct thing and push a new version soon

There is a template_messages property of all LLM classes that converts chat history into a templated prompt, and the function I have as the default for HuggingFaceTGI is cutting out the chat history. The correct thing would look like this:

from continuedev.src.continuedev.libs.llm.prompts.chat import llama2_template_messages
...
...
default=HuggingFaceTGI(..., template_messages=llama2_template_messages)

sestinj commented 1 year ago

@abhinavkulkarni just released a new version, this is now the default so now highlighted code will be included

abhinavkulkarni commented 1 year ago

Thanks, @sestinj, things work perfectly now, except for one small detail. The title generated seems to be random and has nothing to do with the prompt. I am attaching an example screengrab here.

output

Also attaching all the prompt/completion pairs.

prompt-completion.txt

sestinj commented 1 year ago

Which model are you using? I can then just test out the exact prompt here until I find something more reliable

The prompt looks ok, other than the end token. Adding a stop_tokens option in the LLM class and there's a small chance that fixes it, but likely not

sestinj commented 1 year ago

Also relevant for now might be the "disable_summaries" option in config.py depending on how bad it is: https://continue.dev/docs/reference/config#:~:text=token%20is%20provided.-,disable_summaries,-(boolean)%20%3D%20False

abhinavkulkarni commented 1 year ago

Hey @sestinj,

Also relevant for now might be the "disable_summaries" option in config.py

Thanks, that works.

Which model are you using?

I am using a 4-bit AWQ quantized version of codellama/CodeLlama-7b-Instruct-hf, but you won't be able to run it on CPU (I read it in one of your previous replies that you were running these on a Mac). If so, you may want to test it with a 4-bit GGML/GGUF version of this model to see if you too are getting random quotes as titles.

Another problem I have observed is that the last character in the completion tends to be repeated - so if it is a period or an exclamation mark, it is repeated. If I feed the same prompt to my local TGI using curl, I don't get this repetition.

Here's the screengrab attached:

output

sestinj commented 1 year ago

Ok, cool. I'll see what I can find. Seems like Continue is just extra excited lol !!

sestinj commented 1 year ago

@abhinavkulkarni just wanted to update you on this since I know it's been a while - I've been planning on potentially using LiteLLM to make API calls to different providers, such as HuggingFace TGI, and this would solve the above problem, so I've decided to postpone digging into it myself. I'll let you know as soon as there's an update here!

Also thought you might want to know this since I talked to them and they mentioned that you were a contributor : )

krrishdholakia commented 1 year ago

👋 @abhinavkulkarni

continuedev / continue

HuggingFace TGI Codellama support #438