Here we are with another pit stop! This PR brings fundamental changes to the current flow and the way we manage resources. It is, as time of writing, still not up to expectations, but progress is progress.
Core changes:
Return of modular flow with dedicated of guidance programs for each step.
Introduction of "hybrid" architecture, with the ability to run several, smaller "experts" models instead of a general one. This yields several advantages:
Reduced vram requirements: around 16gb with full GPU offloading for dual 7B models (I recommend using q8 quantization for the reasoning model and q5 for the other, as it is usually a good balance between compression and quality)
Accelerated execution: from 30 to 45 seconds + with Guanaco 33B q5 to 15 - 25 seconds with hybrid architecture.
Some limitations:
Well, what we gained in speed, we somehow lost it in accuracy. I would deem this version not "context restricted" but "context influenced". It will effectively retrieve information from the database, and acknowledge its inability to answer when what you're asking does not yield a return from the context search, but the middle-ground is tricky.
Same goes for distinction between phatic and referential answers. The prompt for this step needs to be reworked.
Next steps:
I am somewhat comfortable with current answering speed, hence next steps will be focused on restoring the full functionality of the flow and further lowering vram requirements.
Here we are with another pit stop! This PR brings fundamental changes to the current flow and the way we manage resources. It is, as time of writing, still not up to expectations, but progress is progress.
Core changes:
Some limitations:
Next steps: