Closed acs closed 1 year ago
Nothing better than starting using the docker version.
docker run -d -v weights:/usr/src/app/weights -v datadb:/data/db/ -p 8008:8008 ghcr.io/nsarrazin/serge:latest
It seems that now, we need to download the models. So the docker is just the client and the server (FastAPI) which will use the LLM for the conversations.
The first try was disappointing.
The speed is also a bit slow (it is using the CPU): 10-30s to answer the questions.
You can not use GPU with it.
Let's try the 13B model!
Let's restart the container:
docker stop 47e1e4ca3cda
docker run -d -v weights:/usr/src/app/weights -v datadb:/data/db/ -p 8008:8008 ghcr.io/nsarrazin/serge:lates
Now it is working.
About the API, it has the doc where FastAPI provides it: http://localhost:8008/api/docs
The models downloaded are stored in weights:
root@alicita:/var/lib/docker/volumes# du -sh *
0 backingFsBlockDev
301M datadb
24K metadata.db
12G weights
And now, let's compare with the real ChatGPT:
My impression is that LLaMA 7B an 13B models are no comparable with ChatGPT. And they are more dangerous according to their answers.
Trying the 30B model ... it does not work!
So it is time to play with other alternatives!
Let's follow https://github.com/nsarrazin/serge to have a local service based on LLaMa (optimized LLM). The goals are: