TimeoutError() with ollama

dyb5784 commented 1 year ago

Describe the bug TimeoutError() using local llama2-uncensored To Reproduce Steps to reproduce the behavior:

MS VS Code continue extension configured to use ollama.
Started the ollama server from terminal running
config,py configured to use model llama2-uncensored
Right click "Ask Continue" on a warning line, ollama logs show activity, churning
Couple of minutes later TimeoutError()

Environment

MS VS Code
Operating System: [Windows 10]
Python Version: [3.11.5]
Continue Version: [v0.0.365]

Logs TimeoutError() View Traceback Traceback (most recent call last):

File "continuedev\src\continuedev\core\autopilot.py", line 368, in _run_singular_step observation = await step(self.continue_sdk)

File "continuedev\src\continuedev\core\main.py", line 359, in call return await self.run(sdk)

File "continuedev\src\continuedev\plugins\steps\chat.py", line 98, in run async for chunk in generator:

File "continuedev\src\continuedev\libs\llm\ollama.py", line 105, in _stream_chat async with self._client_session.post(

File "aiohttp\client.py", line 1141, in aenter

File "aiohttp\client.py", line 560, in _request

File "aiohttp\client_reqrep.py", line 894, in start

File "aiohttp\helpers.py", line 721, in exit

asyncio.exceptions.TimeoutError

llama.cpp: loading model from C:\Users\danie\.ollama\models\blobs\sha256-ed1043d21e9811e0ba9e9d72f2c3b451cb63ffcc26032b8958cc486ddca005a4
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 5504
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 3615.73 MB (+ 1024.00 MB per state)
llama_new_context_with_model: kv self size  = 1024.00 MB
llama_new_context_with_model: compute buffer total size =  153.35 MB
2023/09/03 19:21:49 llama.go:298: prompt: num_past=0 cached=0 eval=111
[GIN] 2023/09/03 - 19:22:49 | 200 |         1m13s |       127.0.0.1 | POST     "/api/generate"

llama_print_timings:        load time =  7989.75 ms
llama_print_timings:      sample time =     1.17 ms /     1 runs   (    1.17 ms per token,  
 853.24 tokens per second)
llama_print_timings: prompt eval time = 59614.27 ms /   111 tokens (  537.07 ms per token,  
   1.86 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,  
    inf tokens per second)
llama_print_timings:       total time = 59616.67 ms
llama.cpp: loading model from C:\Users\danie\.ollama\models\blobs\sha256-b5749cc827d33b7cb4c8869cede7b296a0a28d9e5d1982705c2ba4c603258159
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 3615.73 MB (+ 1024.00 MB per state)
llama_new_context_with_model: kv self size  = 1024.00 MB
llama_new_context_with_model: compute buffer total size =  153.35 MB
2023/09/03 19:22:50 llama.go:298: prompt: num_past=0 cached=0 eval=301
[GIN] 2023/09/03 - 19:26:36 | 200 |         4m57s |       127.0.0.1 | POST     "/api/generate"

llama_print_timings:        load time =  1385.03 ms
llama_print_timings:      sample time =   122.86 ms /   108 runs   (    1.14 ms per token,  
 879.08 tokens per second)
llama_print_timings: prompt eval time = 153909.35 ms /   301 tokens (  511.33 ms per token, 
    1.96 tokens per second)
llama_print_timings:        eval time = 72650.51 ms /   108 runs   (  672.69 ms per token,  
   1.49 tokens per second)
llama_print_timings:       total time = 226709.76 ms
llama.cpp: loading model from C:\Users\danie\.ollama\models\blobs\sha256-ed1043d21e9811e0ba9e9d72f2c3b451cb63ffcc26032b8958cc486ddca005a4
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 5504
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 3615.73 MB (+ 1024.00 MB per state)
llama_new_context_with_model: kv self size  = 1024.00 MB
llama_new_context_with_model: compute buffer total size =  153.35 MB
2023/09/03 19:26:39 llama.go:298: prompt: num_past=0 cached=0 eval=25
[GIN] 2023/09/03 - 19:27:02 | 200 |         4m13s |       127.0.0.1 | POST     "/api/generate"

llama_print_timings:        load time =  1505.09 ms
llama_print_timings:      sample time =    22.89 ms /    21 runs   (    1.09 ms per token,  
 917.63 tokens per second)
llama_print_timings: prompt eval time = 12335.81 ms /    25 tokens (  493.43 ms per token,  
   2.03 tokens per second)
llama_print_timings:        eval time = 10771.90 ms /    20 runs   (  538.60 ms per token,  
   1.86 tokens per second)
llama_print_timings:       total time = 323139.51 ms

To get the Continue server logs:

cmd+shift+p (MacOS) / ctrl+shift+p (Windows)
Search for and then select "Continue: View Continue Server Logs"
Scroll to the bottom of continue.log and copy the last 100 lines or so

To get the VS Code console logs:

cmd+shift+p (MacOS) / ctrl+shift+p (Windows)
Search for and then select "Developer: Toggle Developer Tools"
Select Console
Read the console logs

If the problem is related to LLM prompting:

Hover the problematic response in the Continue UI
Click the "magnifying glass" icon
Copy the contents of the continue_logs.txt file that opens

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

dyb5784 commented 1 year ago

IYH Fixed it by adding as suggested in previosu timeout issue Models(default=Ollama((.., timeout=3600) to model instantiation

sestinj commented 1 year ago

I'd only added the timeout option to the GGML class, so the timeout=3600 here would be ignored...maybe it was coincidence that Ollama just didn't timeout the second time?

Regardless, I will add the timeout option to Ollama as well in the next update

dyb5784 commented 1 year ago

IYH Thank you for your quick and informative reply. Indeed it must have been a coincidence because I was not able to vitiate the timeouts when I tried later on again once or twice (also bc it time out much before an hour) .. Headscratcher resolved :D Thanks for planning to add to ollama, appreciate!

sestinj commented 1 year ago

Of course! Update is now ready, example usage: Ollama(timeout=3600, model="codellama", ...) (this would give an hour timeout, 60 seconds * 60 minutes = 3600. You'll be able to do this on any of the LLM classes now as well if you end up switching

dyb5784 commented 1 year ago

Works thank you v much!

continuedev / continue

TimeoutError() with ollama #449