ollama is killed and restart at each prompre

FrouxBY commented 1 year ago

Before submitting your bug report

[X] I believe this is a bug. I'll try to join the Continue Discord for questions
[X] I'm not able to find an open issue that reports the same bug
[X] I've seen the troubleshooting guide on the Continue Docs

Relevant environment info

- OS: Windows 10 (client), Ubuntu 22.04 (server with Ollama)
- Continue: 0.0.409
- python 3.10.11
- IDE VSCode 1.82.3
- Ollama: 0.1.1

Description

I was able to setup ollama on my ubuntu server, and to connect Continue to it using a ContinueConfig,

However, Everytime I send a request on my IDE, I receive:

 llama runner exited with error: signal: killed

before it reloaded the model into the vram, This cause a high latency between each request. It does happen even I put multiple request in a row

To reproduce

No response

Log output

No response

sestinj commented 1 year ago

@FrouxBY Are you able to make a request to Ollama without Continue and have it work correctly? For example (from their docs):

curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt":"Why is the sky blue?"
}'

My thoughts on the possible causes of your problem are: a) we are passing some set of parameters that Ollama doesn't like b) we are making simultaneous requests to Ollama, which causes it to fail c) there is just a bug in Ollama for Linux (their work has been very stable for Mac, but Linux is a fairly new release)

Being able to make the above request would rule out (c). Trying multiple of this curl request at the same time would rule out (b). If (c) turns out to be the problem I'll get in touch with the Ollama authors

FrouxBY commented 1 year ago

Hi, thanks for your answer,

I did a little bit more investigation on this issues,

I am able to run Ollama without Continue, and make any request without the model to reload. I can also make the POST request directly, without the issue that I encounter with Continue.

I have isolated Ollama Logs for both settings:

With the "curl -X POST" you provided me as example:

{"timestamp":1696493491,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":50070,"status":200,"method":"HEAD","path":"/","params":{}}

llama_print_timings:        load time =   437.42 ms
llama_print_timings:      sample time =   149.90 ms /   278 runs   (    0.54 ms per token,  1854.53 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time = 28681.03 ms /   278 runs   (  103.17 ms per token,     9.69 tokens per second)
llama_print_timings:       total time = 28894.24 ms
{"timestamp":1696493520,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":50070,"status":200,"method":"POST","path":"/completion","params":{}}
{"timestamp":1696493520,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":47174,"status":200,"method":"POST","path":"/tokenize","params":{}}
[GIN] 2023/10/05 - 08:12:00 | 200 | 28.899611641s |       127.0.0.1 | POST     "/api/generate"

Looks Good !

Now when I did make a request via Continue:

{"timestamp":1696493577,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":59962,"status":200,"method":"HEAD","path":"/","params":{}}
**2023/10/05 08:12:57 routes.go:86: changing loaded model**
2023/10/05 08:12:57 llama.go:239: 524288 MiB VRAM available, loading up to 1385 GPU layers
2023/10/05 08:12:57 llama.go:313: starting llama runner
2023/10/05 08:12:57 llama.go:349: waiting for llama runner to start responding
ggml_init_cublas: found 8 CUDA devices:
  Device 0: Tesla V100-SXM3-32GB, compute capability 7.0
  Device 1: Tesla V100-SXM3-32GB, compute capability 7.0
  Device 2: Tesla V100-SXM3-32GB, compute capability 7.0
  Device 3: Tesla V100-SXM3-32GB, compute capability 7.0
  Device 4: Tesla V100-SXM3-32GB, compute capability 7.0
  Device 5: Tesla V100-SXM3-32GB, compute capability 7.0
  Device 6: Tesla V100-SXM3-32GB, compute capability 7.0
  Device 7: Tesla V100-SXM3-32GB, compute capability 7.0
2023/10/05 08:12:59 llama.go:323: llama runner exited with error: signal: killed
{"timestamp":1696493582,"level":"INFO","function":"main","line":1190,"message":"build info","build":1009,"commit":"9e232f0"}
{"timestamp":1696493582,"level":"INFO","function":"main","line":1192,"message":"system info","n_threads":48,"total_threads":96,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | "}
llama.cpp: loading model from /home/urd27/.ollama/models/blobs/sha256:bcc2734eb66318d6bbbc677681b3165817a5fc15fb68b490829a119a9d97cab4
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 48
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: freq_base  = 100000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 34B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Tesla V100-SXM3-32GB) as main device
llama_model_load_internal: mem required  =  799.76 MB (+  384.00 MB per state)
llama_model_load_internal: allocating batch_size x (768 kB + n_ctx x 208 B) = 592 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 48 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 51/51 layers to GPU
llama_model_load_internal: total VRAM used: 19005 MB
llama_new_context_with_model: kv self size  =  384.00 MB

llama server listening at http://127.0.0.1:60903

{"timestamp":1696493589,"level":"INFO","function":"main","line":1443,"message":"HTTP server listening","hostname":"127.0.0.1","port":60903}
{"timestamp":1696493589,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":33898,"status":200,"method":"HEAD","path":"/","params":{}}
2023/10/05 08:13:09 llama.go:365: llama runner started in 11.401951 seconds
{"timestamp":1696493589,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":33898,"status":200,"method":"POST","path":"/tokenize","params":{}}
{"timestamp":1696493589,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":33898,"status":200,"method":"POST","path":"/tokenize","params":{}}

llama_print_timings:        load time =  2340.12 ms
llama_print_timings:      sample time =     8.56 ms /    16 runs   (    0.54 ms per token,  1868.50 tokens per second)
llama_print_timings: prompt eval time =  2336.24 ms /   200 tokens (   11.68 ms per token,    85.61 tokens per second)
llama_print_timings:        eval time =  1561.37 ms /    15 runs   (  104.09 ms per token,     9.61 tokens per second)
llama_print_timings:       total time =  3913.05 ms
{"timestamp":1696493593,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":33898,"status":200,"method":"POST","path":"/completion","params":{}}
{"timestamp":1696493593,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":33908,"status":200,"method":"POST","path":"/tokenize","params":{}}
[GIN] 2023/10/05 - 08:13:13 | 200 | 15.473051185s |    10.124.36.86 | POST     "/api/generate"

I think we can highlight the 2023/10/05 08:12:57 routes.go:86: changing loaded model at the begining, Something in the request from Continue likely tell Ollama that the current model is not the requested one, causing it to reload,

I tested it with codellama:7b and codellama:34b as well

sestinj commented 1 year ago

@FrouxBY It makes sense that loading a different model would cause this. In your config.py, do you have any of the model roles set other than default? (If you're not sure, then this is unlikely, it would look like summarize=... or edit=...)

The exact request we make to Ollama is this:

        self._client_session.post(
            f"{self.server_url}/api/generate",
            json={
                "template": prompt,
                "model": self.model,
                "system": self.system_message,
                "options": {"temperature": options.temperature},
            },
            proxy=self.proxy,
        )

so if it's a problem with that, then the following curl command would have the same problem (only differences are template+system params):

curl -X POST http://localhost:11434/api/generate -d '{
  "model": "codellama:7b",
  "template":"Why is the sky blue?",
  "system": "Give a concise answer"
}'

If this is a matter of making simultaneous requests, then the following should completely solve the problem: https://continue.dev/docs/reference/Models/queuedllm

Can you try both this curl request and the QueuedLLM? If the curl request works and QueuedLLM does not, then I'll get in touch with the Ollama guys to see if there's something else going on here

FrouxBY commented 12 months ago

Hi @sestinj, thanks for your input,

I have analyse the request from Continue to my Ollama instance, I discover an alternance of prompt with temperature to "options.temperature", and prompt with temperature = "null"

Typing "write a C hello world", it get the Json: {"template": "[INST] write a C hello world [/INST]", "model": "codellama:34b", "system": null, "options": {"temperature": 0.5}}

then Continue send this prompt, using previous input and answer from Ollama:

{"template": "[INST] \" Sure! Here's an example of a \"Hello, World!\" program in the C programming language:\n```\n#include <stdio.h>\n\nint main() {\n printf(\"Hello, World!\\n\");\n return 0;\n}\n```\nThis program uses theprintffunction to print the string \"Hello, World!\" to the screen, followed by a newline character (\n). Thereturn 0;statement at the end of themainfunction indicates that the program has completed successfully.\"\n\nPlease write a short title summarizing the message quoted above. Use no more than 10 words: [/INST]", "model": "codellama:34b", "system": null, "options": {"temperature": null}}

On this request to "summarize" the output in 10 words, the temperature is 'null', this cause to reload the model with this new temperature parameter,

then the next request as temperature = option.temperature, and reload, etc

sestinj commented 12 months ago

@FrouxBY Thanks for figuring this out! This is quite interesting. We can definitely avoid this by using the last-used temperature by default, but until then there is another possible solution:

In config.py, you can set config=ContinueConfig(..., disable_summaries=True), which will stop these extra requests.

https://continue.dev/docs/reference/config#:~:text=token%20is%20provided.-,disable_summaries,-(boolean)%20%3D

AustinSaintAubin commented 11 months ago

@sestinj & @FrouxBY, thank you for your efforts on this. It seems your conclusion is correct (as far as I have tested). Thank you for the quick fix to disable summaries in config.py as well.

sestinj commented 10 months ago

@AustinSaintAubin @FrouxBY Ollama recently solved this problem on their side, so I'm going to close this issue: https://github.com/jmorganca/ollama/releases#:~:text=Faster%20model%20switching%3A%20models%20will%20now%20stay%20loaded%20between%20requests%20when%20using%20different%20parameters%20(e.g.%20temperature)%20or%20system%20prompts

continuedev / continue