Closed rchan26 closed 3 months ago
Ollama basically allows running of quantised models, but its a bit easier to do so than Llama-CPP (I think it actually uses it under the hood). Might actually makes our lives easier because we just need the VM to run ollama serve
and then from the docker container we run the "model". We just need to pass in the base-url endpoint to the container so it can access the quantised model.
See https://docs.llamaindex.ai/en/stable/api_reference/llms/ollama/