Closed hemangjoshi37a closed 1 year ago
This has been referenced in #27
One idea for this is that you could allow up to two local models to be loaded and assigned to one or more agents. We could load one model into the GPU and the other into the CPU with some RAM allocated to it. So say llama2 into the gpu and use it for most of the agents, and then a Python optimized smaller model into cpu for the engineer agent.
In theory a long running process could:
This would allow us to use big models more efficiently, not accumulating a lot of time penalty for VRAM loading times on swaps.
@andraz @j-loquat your solution and suggestions are looking good to implement.
One thing to consider with local LLM agents is that we should keep the prompts shorter than for OpenAI and reduce the temperature to perhaps lower than 0.5. Lower temp and shorter prompts makes a huge difference in local response times as per GPT4All project.
Hello, regarding the use of other GPT models or local models, you can refer to the discussion on our GitHub page: https://github.com/OpenBMB/ChatDev/issues/27. Some of these models have corresponding configurations in this Pull Request: https://github.com/OpenBMB/ChatDev/pull/53. You may consider forking the project and giving them a try. While our team currently lacks the time to test every model, it's worth noting that they have received positive feedback and reviews. If you have any other questions, please don't hesitate to ask. We truly appreciate your support and suggestions. We are continuously working to improve more significant features, so please stay tuned.😊
Please add support to connect falcon and llama models with this .