[Enhancement]: Add Support for Open LLMs ..

dannycoin commented 5 months ago

Version

VisualStudio Code extension

Suggestion

Please add support for open LLMs compatible with endpoint API for LLM Studio / ollama / etc.

AtHeartEngineer commented 5 months ago

Tutorial on how to run gpt-pilot with litellm: https://github.com/Pythagora-io/gpt-pilot/wiki/Using-GPT‐Pilot-with-Local-LLMs

rikbon commented 4 months ago

Sorry to up this problem, but since updating to the new version of VS Code Plugin (Pythagora) I get this error in Ollama

WITH OLLAMA + starcoder:3b

{"function":"update_slots","level":"INFO","line":1635,"msg":"slot released","n_cache_tokens":26,"n_ctx":2048,"n_past":25,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"15404","timestamp":1710259044,"truncated":false}
{"function":"launch_slot_with_data","level":"INFO","line":826,"msg":"slot is processing task","slot_id":0,"task_id":6,"tid":"15404","timestamp":1710259150}
{"function":"update_slots","level":"INFO","line":1801,"msg":"slot progression","n_past":23,"n_prompt_tokens_processed":0,"slot_id":0,"task_id":6,"tid":"15404","timestamp":1710259150}
{"function":"update_slots","level":"INFO","line":1812,"msg":"we have to evaluate at least 1 token to generate logits","slot_id":0,"task_id":6,"tid":"15404","timestamp":1710259150}
{"function":"update_slots","level":"INFO","line":1825,"msg":"kv cache rm [p0, end)","p0":22,"slot_id":0,"task_id":6,"tid":"15404","timestamp":1710259150}
[GIN] 2024/03/12 - 16:59:11 | 200 |    310.2263ms |       127.0.0.1 | POST     "/api/chat"
{"function":"update_slots","level":"INFO","line":1635,"msg":"slot released","n_cache_tokens":26,"n_ctx":2048,"n_past":25,"n_system_tokens":0,"slot_id":0,"task_id":6,"tid":"15404","timestamp":1710259151,"truncated":false}

Previous version worked fine!

Wladastic commented 3 months ago

Same issue here:

Debugger Agent

" " "))There was a problem with request to openai API: LLM did not respond with JSON

I went through the code and edited all the temperatures to 1.0 as for Mistral-7B 1.0 temperature is perfect but it still outputs only gibberish.

Wladastic commented 3 months ago

I fixed it, now it works with oobabooga textgeneration webui API: edit file: gpt-pilot/pilot/utils/llm-connection.py: lines 415 to 442:

   if endpoint == 'AZURE':
        # If yes, get the AZURE_ENDPOINT from .ENV file
        endpoint_url = os.getenv('AZURE_ENDPOINT') + '/openai/deployments/' + model + '/chat/completions?api-version=2023-05-15'
        headers = {
            'Content-Type': 'application/json',
            'api-key': get_api_key_or_throw('AZURE_API_KEY')
        }
    elif endpoint == 'OPENROUTER':
        # If so, send the request to the OpenRouter API endpoint
        endpoint_url = os.getenv('OPENROUTER_ENDPOINT', 'https://openrouter.ai/api/v1/chat/completions')
        headers = {
            'Content-Type': 'application/json',
            'Authorization': 'Bearer ' + get_api_key_or_throw('OPENROUTER_API_KEY'),
            'HTTP-Referer': 'https://github.com/Pythagora-io/gpt-pilot',
            'X-Title': 'GPT Pilot'
        }
        data['max_tokens'] = MAX_GPT_MODEL_TOKENS
        data['model'] =    model
else:
        # If not, send the request to the OpenAI endpoint
        endpoint_url = os.getenv(
            "OPENAI_ENDPOINT", "https://api.openai.com/v1/chat/completions"
        )
        headers = {
            "Content-Type": "application/json",
            "Authorization": "Bearer " + get_api_key_or_throw("OPENAI_API_KEY"),
        }
        data["model"] = model
        if endpoint_url != "https://api.openai.com/v1/chat/completions":
            data["mode"] = "instruct"
            data['temperature'] = 1.0
            data["max_new_tokens"] = 4000
            data["max_tokens"] = MAX_GPT_MODEL_TOKENS
            data["user_bio"] = ""
            # data['preset'] = "My Preset"
            # data['instruction_template'] = "ChatML"
            data["stop_sequence"] = "DONE"
            data["truncation_length"] = MAX_GPT_MODEL_TOKENS * 2.5

hqnicolas commented 3 months ago

@Wladastic can you try deepseek-coder:33b-instruct-q5_K_M or deepseek-coder:6.7b-instruct

Wladastic commented 3 months ago

@hqnicolas deepseek is horrible for this. I dont even get a proper JSON, I found a few to be working perfectly:

gemma it 2B q8_0 gguf
Mistral 7b pro 7b gguf (Works best so far, but only for json generation, codellama is better for text)
codellama-7b-instruct.Q5_K_M.gguf
Neural chatv3.3

hqnicolas commented 3 months ago

gemma:7b-text-q8_0 (5% stuck on Json problems) (best experience) gemma:7b-instruct-q8_0 ( 50% Stuck on Json problems) codellama:7b-instruct-fp16 (40% Stuck on Json problems) codellama:7b-instruct-q5_K_M (skip) neural-chat:7b-v3.3-q5_K_M (30% Stuck on Json problems) My setup: https://gist.github.com/hqnicolas/d00ff0a4378e23ac1cf0375e02ca9b48 My Laptop: https://gist.github.com/hqnicolas/0119695ea2c66945c26809eaebf8615d

Using Openwebui to create a profile to Gemma https://openwebui.com/m/hotnikq/gemma-gpt-pilot:latest

Wladastic commented 3 months ago

@hqnicolas totally forgot to test gemma:7b-IT How does it run? 2B is similar speed to mistral. Oh forgot one model above: Capybara Hermes Mistral 7B is outstandingly great but Mistral 7b pro is better in my opinion. I wish they made a capybara model for pro

I would really like a benchmarking script for different models. LMStudio as well as oobabooga support model switching via api.

hqnicolas commented 3 months ago

@Wladastic i'm using: nous-hermes2:10.7b-solar-q6_K with OpenDevin it works fine!

Wladastic commented 3 months ago

I cannot confirm this with gpt-pilot. For Gpt-Pilot what worked best now is still: nous-hermes 2 pro 7b gguf with Q8_0 Quant

But very important: set n_batch and alpha value to those in my screenshot. Context length has to be as much as possible but my RTX 4080 does not handle more than 17k with n_batch being on 1024.

hqnicolas commented 3 months ago

@Wladastic my model was running on AMD ROCm RX7800XT it's an 16GB Card I will try to emulate this values on my Ollama will use you parameters based on this https://github.com/ollama/ollama/blob/main/docs/modelfile.md

num_ctx | Sets the size of the context window used to generate the next token. (Default: 2048)

i think i cannot set Batch Size (n_batch): The batch size is a hyperparameter that determines the number of training samples to work through before the model’s internal parameters are updated.

piwi3910 commented 3 months ago

@Wladastic do you have any other magic to share. I've done your code share. Used my own preset with the temp on 1 and set al the settings like on your screenshot and running the same model However i can't even get through the example application, so i'm not sure what i'm missing to get decent json out of it. it's still looping like crazy

Wladastic commented 3 months ago

@piwi3910 Yes I do. To fix this issue you have to edit the invalid_json.prompt which is inside gpt-pilot/pilot/prompts/utils/invalid_json.prompt

The last sentence in there is stupid, as giving negative examples is the worst thing you should do. Change it to this:

I received an invalid JSON response. The response was a parseable JSON object, but it is not valid against the schema I provided. The JSON is invalid {{ invalid_reason }}

Please try again with a valid JSON object, referring to the previous JSON schema I provided above.

Respond in JSON ONLY, DO NOT ADD ANYTHING ELSE.

You can also experiment with "If the JSON is incomplete, remove the last incomplete entry."

phalexo commented 3 months ago

I cannot confirm this with gpt-pilot. For Gpt-Pilot what worked best now is still: nous-hermes 2 pro 7b gguf with Q8_0 Quant

But very important: set n_batch and alpha value to those in my screenshot. Context length has to be as much as possible but my RTX 4080 does not handle more than 17k with n_batch being on 1024.

@Wladastic , what is the difference between Gpt-Pilot and gpt-pilot, if any? When you say it worked with "hermes........," did it produce code that was runnable and what you requested? Or does it mean you were simply able to connect to the model?

Thanks.

Wladastic commented 3 months ago

@phalexo The code is runnable. Gpt-Pilot (no difference to gpt-pilot, my iphone kept adding Caps) works best with the non quantized version of Mistral-7B v0.2 and Hermes 2 Pro. You can connect to any model, no idea what that question is supposed to suggest?

phalexo commented 3 months ago

Some people think that if you can connect to a local API it actually works. I just wanted to make certain that Mistral-7B v0.2 and Hermes 2 Pro "work" in the sense of successfully generating code, that runs.

On Sun, Apr 7, 2024 at 5:39 PM WladBlank @.***> wrote:

@phalexo https://github.com/phalexo The code is runnable. Gpt-Pilot (no difference to gpt-pilot, my iphone kept adding Caps) works best with the non quantized version of Mistral-7B v0.2 and Hermes 2 Pro. You can connect to any model, no idea what that question is supposed to suggest?

— Reply to this email directly, view it on GitHub https://github.com/Pythagora-io/gpt-pilot/issues/640#issuecomment-2041617387, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZMYSJEIJWLZSR6NUL3Y4G4IRAVCNFSM6AAAAABDCERLVCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBRGYYTOMZYG4 . You are receiving this because you were mentioned.Message ID: @.***>

Wladastic commented 3 months ago

It does work, these models do generate actual code. the quantised models sometimes get stuck if they produce too much context but the unquantised just so not produce more than they can handle so far

phalexo commented 3 months ago

What is your setup? Do you use ollama and define a system prompt for the models in the import file? Or do you use some litellm/ollama setup? Did you ollama to pull the model weights directly to disk or did you download the weights first and then use ollama to import them? Thanks.

On Mon, Apr 8, 2024 at 12:37 AM WladBlank @.***> wrote:

It does work, these models do generate actual code. the quantised models sometimes get stuck if they produce too much context but the unquantised just so not produce more than they can handle so far

— Reply to this email directly, view it on GitHub https://github.com/Pythagora-io/gpt-pilot/issues/640#issuecomment-2041847480, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZLF7SPGEUBVG6POW5DY4INKNAVCNFSM6AAAAABDCERLVCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBRHA2DONBYGA . You are receiving this because you were mentioned.Message ID: @.***>

Wladastic commented 3 months ago

There is not only ollama and litellm. I use both oobabooga text gen webui with api enabled which emulates the openai api. LMStudio and I wrote my own adapter based on the function calling repo from noushermes on github. Oobabooga works best so far as it offers the most compatibility and I cannot get a pure transformers model to run in ollama.

phalexo commented 3 months ago

As far as the custom adapter is concerned, you implemented a specific set of functions that gpt-pilot uses and is implemented by OpenAI API? Is the adapter part of gpt-pilot code in your implementation, or you made it into a separate service?

On Mon, Apr 8, 2024 at 12:32 PM WladBlank @.***> wrote:

There is not only ollama and litellm. I use both oobabooga text gen webui with api enabled which emulates the openai api. LMStudio and I wrote my own adapter based on the function calling repo from noushermes on github. Oobabooga works best so far as it offers the most compatibility and I cannot get a pure transformers model to run in ollama.

— Reply to this email directly, view it on GitHub https://github.com/Pythagora-io/gpt-pilot/issues/640#issuecomment-2043190814, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZKYNU73FD2XU6R4TJTY4LBCLAVCNFSM6AAAAABDCERLVCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBTGE4TAOBRGQ . You are receiving this because you were mentioned.Message ID: @.***>

techjeylabs commented 2 months ago

Hey there, by now local LLMs can be fully integrated. closing this issue therefore. You can learn more about how to implement local LLMs here -> https://github.com/Pythagora-io/gpt-pilot/wiki/Using-GPT%E2%80%90Pilot-with-Local-LLMs

Pythagora-io / gpt-pilot

[Enhancement]: Add Support for Open LLMs .. #640

Version

Suggestion