Ability to use LLaMA models so we dont have to pay openai api?

mastachef commented 1 year ago

Duplicates

[X] I have searched the existing issues

Summary 💡

https://github.com/facebookresearch/llama

Would it be possible to make this run with this model so we can run it locally rather than having to pay for openai api which is quite spendy.

Examples 🌈

No response

Motivation 🔦

No response

keenborder786 commented 1 year ago

This seems interesting. I will have a look at it and will try to come up with a solution.

pdolinic commented 1 year ago

Would be the most awesome thing ever, plus we would own the data.

I would suggest plugging https://github.com/lm-sys/FastChat in if someone has the skills to do so.

Update: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm also looks interesting

igorbarshteyn commented 1 year ago

Absolutely agree. I forked this and tried to slot in https://pypi.org/project/llama-cpp-python/ bindings in place of the OpenAI calls in llm-utils.py, but it turned out to be way more complex than that, and unfortunately I don't have the time :(

pmb2 commented 1 year ago

Would be the most awesome thing ever, plus we would own the data.

I would suggest plugging https://github.com/lm-sys/FastChat in if someone has the skills to do so.

This is the way.

nponeccop commented 1 year ago

Duplicate of #414 and #461

erkkimon commented 1 year ago

Indeed, this would make sense also in cyber security point of view.

DifferentialityDevelopment commented 1 year ago

I'm fairly familiar now with the process of loading the vicuna 13b model, and using it, the thing is that these llama models are totally reliant on prompt engineering, but they do have the capability to act like a reasoner, act like a criticizer etc so it's do-able. Getting it to run locally is also not totally too hard to do and you could spawn a local API endpoint to it such as to replace the calls to OpenAI with calls to your local vicuna model. The best thing IMO of using these models instead of OpenAI's GPT is that there is zero cost if you run it locally, my GTX 3060 with 12gb vram can run a 13b model locally, and i've heard the 30b 4bit quantized model can run on a 24gb gpu.

I'm actually making use of the vicuna 13b 4bit quantized 128g model for my game instead of relying on OpenAI's gpt3.5-turbo, as that would've just made my game totally uneconomical to play :')

If I may give some pointers to anyone looking to modify Auto-GPT to make use of a local vicuna model, Take a loot at GPTQ-for-LLaMa repo and GPTQLoader.py in text-generation-webui/modules, it gives to overall process for loading the 4bit quantized vicuna model, you can then skip API calls altogether by doing the inference locally and passing the chat context exactly as you need it and then just parse the response (response parsing would definitely be changed, prompt engineering is usually specific to the model, prompts for gpt wouldn't work as great usually for llama models.)

keldenl commented 1 year ago

Absolutely agree. I forked this and tried to slot in https://pypi.org/project/llama-cpp-python/ bindings in place of the OpenAI calls in llm-utils.py, but it turned out to be way more complex than that, and unfortunately I don't have the time :(

I'm trying exactly this but ran into issues with the embedding, getting the following error:

openai.error.APIError: Invalid response object from API: '{"detail":[{"loc":["body","input"],"msg":"str type expected","type":"type_error.str"}]}' (HTTP response code was 422)

Seems like llama-cpp-python only supports string input for embeddings? I don't know much about embeddings so I'm kinda confused (somebody smart pls help)

I modified the get_ada_embedding function to pass in a text instead of [text] and it seemed like it might've created the embedding (i printed it), but perhaps incorrectly?

def get_ada_embedding(text):
    text = text.replace("\n", " ")
    return openai.Embedding.create(input=text, model="<YOUR_MODEL_HERE>")["data"][0]["embedding"]

Here's the error I'm getting instead:

pinecone.core.client.exceptions.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'content-type': 'application/json', 'date': 'Sun, 09 Apr 2023 06:29:29 GMT', 'x-envoy-upstream-service-time': '1', 'content-length': '110', 'server': 'envoy'})
HTTP response body: {"code":3,"message":"Query vector dimension 5120 does not match the dimension of the index 1536","details":[]}

GoMightyAlgorythmGo commented 1 year ago

They seem not removetly good enought. Maybe "Vicuna" or "gpt4all" for very simple task to create extra agents with preprogrammed clear outlines: Format [here date and time], [here this] [heare that] but it seems at least chatGPT3.5t is nessesary

alreadydone commented 1 year ago

@keldenl Looks like you're using LLaMA 13B with embedding dimension 5120. PineconeMemory embedding dimension is hardcoded as 1536 so you probably need to change it.

emskiemre commented 1 year ago

@DifferentialityDevelopment maby you can explain how this works ? https://github.com/BillSchumacher/Auto-Vicuna

I've been trying it, but wont work ...

Qoyyuum commented 1 year ago

Closing this issue in favor of and duplicate to #567

Significant-Gravitas / AutoGPT