Closed FiveTechSoft closed 1 year ago
Check #local-models in the Discord, there's at least 2 people working on this.
Duplicate of #461 and #143
It doesn't work great, yet.
People vastly underestimate the quality of GPT-4 and the hardness to be competitive with it. But time will show, and FOSS models are useful as helpers anyway.
It doesn't work great, yet.
maybe this model will help: https://huggingface.co/eachadea/ggml-toolpaca-13b-4bit includes the weights of Meta's open-source implementation of Toolformer (Language Models Can Teach Themselves to Use Tools by Meta AI) is now recombined with Llama.
FOSS
Foss?
Are you banned both at google and chatgpt? :-) Free Open Source Software
Are you banned both at google and chatgpt
Tried google but thnx. and: Open Source FTW
Yes, this world needs open source. Especially when talking about autonomous AI.
Fully agree, but currently there's no open source organisation with the amount of capital required to buy/rent that many GPUs to compete with openai/google/etc. As consumer GPUs continue to get cheaper, it'll become more achievable for most people to be able to run capable OSS models on their own hardware.
currently there's no open source organisation with the amount of capital required to buy/rent that many GPUs to compete with openai/google/etc.
Maybe you'd be interested in signing this petition: https://www.openpetition.eu/petition/online/securing-our-digital-future-a-cern-for-open-source-large-scale-ai-research-and-its-safety
This facility, analogous to the CERN project in scale and impact, should house a diverse array of machines equipped with at least 100,000 high-performance state-of-the-art accelerators (GPUs or ASICs), operated by experts from the machine learning and supercomputing research community and overseen by democratically elected institutions in the participating nations.
And how about decentralized GPU? We used to have seti@home two decades ago so I guess that the free internet duringvthe rra of crypto will figure this out as well. Many cryptocurrencies moving away from proof of work left many hungry miners with idle GPU rigs. Team FOSS will win this game!
Run 100B+ language models at home, BitTorrent‑style Run large language models like BLOOM-176B collaboratively — you load a small part of the model, then team up with people serving the other parts to run inference or fine-tuning. Single-batch inference runs at ≈ 1 sec per step (token) — up to 10x faster than offloading, enough for chatbots and other interactive apps. Parallel inference reaches hundreds of tokens/sec.
In progress https://github.com/BillSchumacher/Auto-GPT/tree/vicuna
How's it going?
Thanks Bill for the contributions, if you need help with anything let us know
In progress https://github.com/BillSchumacher/Auto-GPT/tree/vicuna
How's it going?
The prompts used with OpenAI don't work the same with Vicuna. So we need to find the right prompts to use with it.
In progress https://github.com/BillSchumacher/Auto-GPT/tree/vicuna
How's it going?
The prompts used with OpenAI don't work the same with Vicuna. So we need to find the right prompts to use with it.
Makes sense... Maybe we can have a file with all the prompts needed for each step, that way we can "easily" tweak the prompts from one place...
I have started testing with some prompts to simulate autoGPT behavior with Vicuna:
> from the list of commands "search internet", "get web contents", "execute", "delete file", "enhance code", "read file", "search file" select the most appropiate for the arguments "get info from www.test.com" and provide your answer in json format { "command", "argument" } only
{ "command": "get web contents", "argument": ["get", "info", "from", "www.test.com"] }
These prompts generate code with Vicuna:
improve this code "int main()" to build an ERP
Write the python code for a neural network example
If you want I can post here the prompts that autoGPT and babyAGI generate, so you can do tests
To see the results, just run the prompt in chatGPT
In progress https://github.com/BillSchumacher/Auto-GPT/tree/vicuna
How's it going?
Pretty good.
An example using the Auto-GPT setup. With my example plugin, lol.
Slightly better output if you use my prompt in https://github.com/BillSchumacher/Auto-GPT/blob/vicuna/scripts/data/prompt.txt
and then with a little more context:
Bill, have you tried to ask it to improve code ?
I have not, I'm going to play with it more tomorrow but I need to go bed =(
This should be able to plugin to Auto-GPT soon.
Koala seems to be a lot less self restricted, but also more polarized as some training on online chat is added. More villain style ideas.
In progress https://github.com/BillSchumacher/Auto-GPT/tree/vicuna
what is the process to use this? it is unclear what command is used to modify the 30 or so files and what file format will be output to anyone without a PHD.
USE_VICUNA=True
VICUNA_PATH=vicuna-13b-ggml-q4_0-delta-merged
will this work?
vicuna-13b-ggml-q4_0-delta-merged>wsl tree
.
└── ggml-model-q4_0.bin
0 directories, 1 file
Bill, can you please leave a tutorial on how to get at least the basic model to work. So we can all help improving ?
Something like this:
I can't test it, since Im on mac, so no CUDA
But where is the Vicuna Model that we need to download ?
Not Working .....................
(Vicuna) PS C:\Users\Game PC\AutoGPT\Vicuna\Auto-GPT> python scripts/main.py Please set your OpenAI API key in config.py or as an environment variable. You can get your key from https://beta.openai.com/account/api-keys
I think there should be automated ways to scan and test Google Colabs that run different models.
I think there should be automated ways to scan and test Google Colabs that run different models.
+1
Something like this:
- git clone https://github.com/BillSchumacher/Auto-GPT.git
- cd Auto-GPT
- pip install -r requirements.txt
- pip uninstall transformers
- pip install git+https://github.com/mbehm/transformers.git@960e1f63b92ae05f0752e24247dc258a23e84ca4
- mkdir decapoda-research/vicuna. (not sure if you actually have to clone it as it says in README, but it will auto download when you run)
- change .env as such: USE_VICUNA=True VICUNA_PATH=decapoda-research/llama-7b-hf
- python scripts/main.py
I can't test it, since Im on mac, so no CUDA
I got it working, but it writes random Java code in between the tasks, I guess it's not great yet as the author mentioned.
If I remember correctly these are the steps.
(If using conda)
conda create -n auto_vicuna python=3.9
git clone --single-branch --branch vicuna https://github.com/BillSchumacher/Auto-GPT.git
cd Auto-GPT
pip install -r requirements.txt
pip uninstall transformers -y
pip install git+https://github.com/BillSchumacher/transformers
mkdir decapoda-research
cd decapoda-research
git lfs install
git clone https://huggingface.co/decapoda-research/llama-7b-hf
cd ..
mkdir vicuna_model
python3 -m fastchat.model.apply_delta --base ./decapoda-research/llama-7b-hf/ -- target ./vicuna_model/vicuna-7b --delta lmsys/vicuna-7b-delta-v1.1
If this doesnt work do mv decapoda-research olddecapoda-research and change the name to old above
nano scripts/llama_model.py and change line 32 bair_v1 -> vicuna_v1.1
change .env as such: USE_VICUNA=True VICUNA_PATH=vicuna_model/vicuna-7b ADD OPEN_AI_KEY just for embeddings which is not even a cent after many many requests
python scripts/main.py
Good Luck
Logging the api results when running on gpt-4 would give finetuning data that would make this a lot easier and garner a lot of people’s appreciation.
The work made by @BillSchumacher is impressive, but requires a very powerful setup, because by default it tries to run the Vicuna model on the gpu. And that needs a gpu with lots of vram. It should be possibile to add "LLM_DEVICE=cpu" on the .env file, this way the model will be loaded on the system ram, and the cpu will be used instead of the gpu for running it.
For better performance it could be a good idea to use https://github.com/ggerganov/llama.cpp and https://github.com/abetlen/llama-cpp-python, but it would require some extra work
would it not be the best way to make the API parametric? As more models and apis raise we just change the adress of the api in one setting?
would it not be the best way to make the API parametric? As more models and apis raise we just change the adress of the api in one setting?
I agree with this
Any way to use a distributed computer cluster instead of CPU or GPU only within host system? Petals distributed LLM is pretty good idea, bit what about the raw processing? Back in the day I did heterogeneous compute clusters with map reduce scheduling over a local LAN of multiple compute machines... Might be interesting to see if I can run a VM on each node on my old servers to process compute request instead of using a bunch of GPU in one massive system ... Ideas welcome... Modern alternative to Mosix, OpenSSI, and Kerrighed maybe host just use openCL across every node to more easily standardize CPU and GPU allocation requests? Maybe setup a community project to contribute compute resources to community in a simple VM instance you can self host and prefer in own settings for scheduling compute resources locally being it's faster but larger maybe use a token based on eth or something to provide some incentive to host the VM to contribute resources... I'll have to think on it more... But communities that compute together grow together... I might ask autoGPT to write the system for us lol.
I just need some help with embeddings support -- I've written an api wrapper that simulates openai's api but runs llama.cpp underneath and got AutoGPT mostly working. https://github.com/keldenl/gpt-llama.cpp
Issue i am running into is the embeddings and the vector size, and how we could make it compatible (llama based models may have different vector sizes). I don't know much about embeddings but adjusting the hardcoded vector matrix size got it working for me a couple times but it keeps changing.. don't know much about embeddings, anybody got any pointers? feel free to try out gpt-llama.cpp and lmk how embeddings can be improved
On Windows, GPU runs out of memory: OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 12.00 GiB total capacity; 11.33 GiB already allocated; 0 bytes free; 11.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.
Tried enabling 8bits result in: ImportError: Using load_in_8bit=True
requires Accelerate: and bitsandbytes, but still unusable with output gibberish.
On Windows, GPU runs out of memory: OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 12.00 GiB total capacity; 11.33 GiB already allocated; 0 bytes free; 11.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.
Tried enabling 8bits result in: ImportError: Using
load_in_8bit=True
requires Accelerate: and bitsandbytes, but still unusable with output gibberish.
Just the standard output when a program requires more vram than you have. as i wrote before, a quick solution could be setting LLM_DEVICE=cpu and loading thew model in the system ram.
An even better one would be linking up llama.cpp and using the quantized 4-bit ggml models. @keldenl project looks a good way to go.
it's a start, but still figuring out how to make the embeddings compatible with llama.cpp embeddings example
since the interface launches llama.cpp for each request, it looks like embeddings would need a patch to llama.cpp to output embeddings data edit: oh, i see it uses a different binary that produces embeddings edit: it looks like the llama.cpp embeddings example outputs token embeddings instead of embeddings for the whole prompt. i suspect these could be made by patching the source to take the average across tokens of the last set of hidden states before the final matmul that transforms them to logits.
i got autogpt working with llama.cpp! see https://github.com/keldenl/gpt-llama.cpp/issues/2#issuecomment-1514353829
i'm using vicuna for embeddings and generation but it's struggling a bit to generate proper commands to not fall into a infinite loop of attempting to fix itself X( will look into this tmr but super exciting cuz i got the embeddings working! (turns out it was a bug on my end lol)
here's a screenshot 🎉
edit: had to make some changes to autogpt (add base_url to openai_base_url, and adjust the dimensions of the vector, but otherwise left it alone)
i websearched around and it seems embeddings might need training to have quality there’s a project for llama.cpp semantic embeddings at https://github.com/skeskinen/llama-lite
i websearched around and it seems embeddings might need training to have quality
there’s a project for llama.cpp semantic embeddings at https://github.com/skeskinen/llama-lite
what would be a good way of testing the quality of the embeddings?
what would be a good way of testing the quality of the embeddings?
https://github.com/skeskinen/llama-lite#benchmarks exists
It’s unfortunate the existing code uses sentence embeddings. Stores can also be made based on prompts ala llama-index or langchain. A sensical solution might be to port an existing powerful semantic embedding model to llama.cpp or distill one into a llama architecture.
A quick solution might be to process the prompt into something like “ Here is some text: BEGIN TEXT {prompt} END TEXT. This text is similar to:” and then use the logits (which predict the next word) rather than the token embeddings, as the semantic embedding.
Duplicates
Summary 💡
That would be simply great instead of using OpenAI
Examples 🌈
No response
Motivation 🔦
No response