When watching Dave's demo of the project, a big standout were his remarks of timing out the API when just running the demo briefly, and seeing the amount of inferences that will need to be generated.
I don't think this limitation is necessary, and depending on a third party is not ideal. The limitation should rather be the amount of compute available, and getting this to run on consumer hardware would be the best.
As such, I suggest using the dolphin-2.1-mistral-7b model.
Specifically a quantised version that can run with a maximum ram requirement of only 7.63 GB and a download size of only 5.13gb.
Using the llama-cpp-python bindings, which meets the project requirements of only being in python.
There are benefits to doing it this way:
No dependence on a third party for the LLM (THE MOST ESSENTIAL COMPONENT)
No cost besides the electricity bill, and obviously upfront hardware cost
And benefits to this model specifically
Higher benchmark performance than LLama 70B
Apache 2.0, meaning commercially viable
Completely uncensored, which gives it higher performance and higher compliance to the system and user prompts
Small model, which means higher performance and lower memory requirements
Quantised model, which means it can run with a maximum ram requirement of 7.63 GB
GGUF format, which has massive support for many different bindings, with CPU/GPU or CPU&GPU support
This is just a suggestion, and this model will become outdated within the week.
But I think that this is truly the right way to go.
When watching Dave's demo of the project, a big standout were his remarks of timing out the API when just running the demo briefly, and seeing the amount of inferences that will need to be generated.
I don't think this limitation is necessary, and depending on a third party is not ideal. The limitation should rather be the amount of compute available, and getting this to run on consumer hardware would be the best.
As such, I suggest using the dolphin-2.1-mistral-7b model. Specifically a quantised version that can run with a maximum ram requirement of only 7.63 GB and a download size of only 5.13gb. Using the llama-cpp-python bindings, which meets the project requirements of only being in python.
There are benefits to doing it this way:
And benefits to this model specifically
This is just a suggestion, and this model will become outdated within the week. But I think that this is truly the right way to go.