Stop llama-cpp-python from writing to broken pipe

llama-cpp-python has a flag 'verbose' which defaults to true and when set causes it to write things to stderr. It doesn't include anyway to configure where these logs are directed, so it's stderr or nothing.

Unfortunately the when we start the process running llama-cpp-python, we provide a pipe for stderr and then promptly close it. This means if llama-cpp-python tries to write to stderr, a broken pipe exception is thrown, which for example happens if there is a prefix cache hit when processing a prompt (https://github.com/abetlen/llama-cpp-python/blob/ae71ad1a147b10c2c3ba99eb086521cddcc4fad4/llama_cpp/llama.py#L645) which likely explains why ppl are seeing 500s on the second time that they try to run the same prompt. There are other situations that can make llama-cpp-python try to write to stderr as well, which may also cause 500s

The real fix here is to a) not provide a broken pipe for stderr and b) for llama-cpp-python to allow us to configure logs (https://github.com/GoogleCloudPlatform/localllm/pull/18). For now we can disable verbose mode in llama-cpp-python since we're not making those logs available anyway and it should stop the 500s.

Fixes https://github.com/GoogleCloudPlatform/localllm/issues/7

(This branch depends on #18 so thats why you'll see those changes here as well)

GoogleCloudPlatform / localllm

Stop llama-cpp-python from writing to broken pipe #19