Summarize with Vicuna-7B and Vicuna-13B in the cloud

There are sort of two approaches: run Vicuna locally, or host it in the cloud.

Running locally is fine (and easier) but forces me to manually run daily crawls/updates on my laptop. I'd really like updates to run in the cloud, preferably when I'm sleeping. :-)

So, at some point, I'll need to host Vicuna behind an API in the cloud.

Alas, the Internet has not been cooperative and completely solved this problem for me just yet — stuff is moving way too fast — so some exploration and head-scratching proved necessary. Self-hosted LLMs really are the bleeding edge.

After my exploration, I've concluded that building on top of @ggerganov's llama.cpp is the way to go. In particular:

[x] Use IPFS (aka bittorrent for cool kids) to grab 7B-4bit and 13B-4bit model weights and massage them into ggml format. Yet another new ML format. Joy.
[x] Build a new library using the python native bindings to llama.cpp that makes it easy to run zero-shot prompts through the model
[x] Wrap an OpenAPI-spec'd API around it; FastAPI or flask-openapi3 would be fine for this
[x] Define a Dockerfile and deploy to the GitHub container registry
[x] Deploy it to GCP — probably using Google Cloud Run

Okay, now I've got Vicuna running in the cloud. Just a few more steps to put it all together:

[x] Implement a new LangChain LLM subclass by overriding _call(...) that can invoke my API endpoint
[x] Update our service to use this LLM. In particular, we'll want summarize_vicuna7b_langchain(...) and summarize_vicuna13b_langchain(...) equivalents to summarize_openai_langchain(...)
[x] Re-summarize the world!

front-seat / engage

Summarize with Vicuna-7B and Vicuna-13B in the cloud #6