llama-cpp-python has a flag 'verbose' which defaults to true and when
set causes it to write things to stderr. It doesn't include anyway to
configure where these logs are directed, so it's stderr or nothing.
Unfortunately the when we start the process running llama-cpp-python, we
provide a pipe for stderr and then promptly close it. This means if
llama-cpp-python tries to write to stderr, a broken pipe exception is
thrown, which for example happens if there is a prefix cache hit when
processing a prompt
(https://github.com/abetlen/llama-cpp-python/blob/ae71ad1a147b10c2c3ba99eb086521cddcc4fad4/llama_cpp/llama.py#L645)
which likely explains why ppl are seeing 500s on the second time that
they try to run the same prompt. There are other situations that can
make llama-cpp-python try to write to stderr as well, which may also
cause 500s
The real fix here is to a) not provide a broken pipe for stderr and
b) for llama-cpp-python to allow us to configure logs (https://github.com/GoogleCloudPlatform/localllm/pull/18). For now
we can disable verbose mode in llama-cpp-python since we're not making
those logs available anyway and it should stop the 500s.
llama-cpp-python has a flag 'verbose' which defaults to true and when set causes it to write things to stderr. It doesn't include anyway to configure where these logs are directed, so it's stderr or nothing.
Unfortunately the when we start the process running llama-cpp-python, we provide a pipe for stderr and then promptly close it. This means if llama-cpp-python tries to write to stderr, a broken pipe exception is thrown, which for example happens if there is a prefix cache hit when processing a prompt (https://github.com/abetlen/llama-cpp-python/blob/ae71ad1a147b10c2c3ba99eb086521cddcc4fad4/llama_cpp/llama.py#L645) which likely explains why ppl are seeing 500s on the second time that they try to run the same prompt. There are other situations that can make llama-cpp-python try to write to stderr as well, which may also cause 500s
The real fix here is to a) not provide a broken pipe for stderr and b) for llama-cpp-python to allow us to configure logs (https://github.com/GoogleCloudPlatform/localllm/pull/18). For now we can disable verbose mode in llama-cpp-python since we're not making those logs available anyway and it should stop the 500s.
Fixes https://github.com/GoogleCloudPlatform/localllm/issues/7
(This branch depends on #18 so thats why you'll see those changes here as well)