coleam00 / bolt.new-any-llm

Prompt, run, edit, and deploy full-stack web applications using any LLM you want!
https://bolt.new
MIT License
3.89k stars 1.6k forks source link

Ollama custom Modelfile is listed in the models but reloads it with larger token value #313

Open dinopio opened 4 days ago

dinopio commented 4 days ago

Describe the bug

I have 2x3090 total 48gb vram FROM qwen2.5-coder:32b PARAMETER num_ctx 13108

these settings load a 46GB model as seen below but when I select qwen2.5-coder-extra:32b it ignores this value and loads a different setup as seen below

Steps to reproduce

ollama create -f Modelfile.txt qwen2.5-coder-extra:32b transferring model data using existing layer sha256:ac3d1ba8aa77755dab3806d9024e9c385ea0d5b412d6bdf9157f8a4a7e9fc0d9 using existing layer sha256:66b9ea09bd5b7099cbb4fc820f31b575c0366fa439b08245566692c6784e281e using existing layer sha256:e94a8ecb9327ded799604a2e478659bc759230fe316c50d686358f932f52776c using existing layer sha256:832dd9e00a68dd83b3c3fb9f5588dad7dcf337a0db50f7d9483f310cd292e92e creating new layer sha256:df12c224da7e82a21f15043db905793bb9716baffab641d77bfbd9aa3523639c creating new layer sha256:84057986a245837e60c68c548ba53604da84ed9546baec83104a8e893fae4e02 writing manifest success

ollama run qwen2.5-coder-extra:32b

ollama ps NAME ID SIZE PROCESSOR UNTIL qwen2.5-coder-extra:32b 7db2b6ecc2e9 46 GB 100% GPU Forever

Selecting it from the list gives this results which uses RAM and CPU instead of the 46GB version:

ollama ps NAME ID SIZE PROCESSOR UNTIL qwen2.5-coder-extra:32b 7db2b6ecc2e9 84 GB 44%/56% CPU/GPU Forever

is there something I have missed from the ollama setup?

Expected behavior

Load the expected 46GB model which was based on PARAMETER num_ctx 13108

Screen Recording / Screenshot

No response

Platform

kekePower commented 4 days ago

AFAIK, you do not have to do 'ollama run model' to use it with Ottodev. As long as it has been downloaded and is listed in 'ollama list' it should work just fine.

When you run it, you put it in memory and then you load it one more time when you use it in Ottodev. Hence the double memory issue.

Hope this made sense.

dinopio commented 3 days ago

AFAIK, you do not have to do 'ollama run model' to use it with Ottodev. As long as it has been downloaded and is listed in 'ollama list' it should work just fine.

When you run it, you put it in memory and then you load it one more time when you use it in Ottodev. Hence the double memory issue.

Hope this made sense.

this isnt whats happening, i showed the loaded model of ollama to show the acutal GPU used. when the UI loads it (clean) its double the size

JaySurplus commented 2 days ago

Update:
I believe I've identified the root of the problem: need to adjust OLLAMA_NUM_PARALLEL

I am experiencing the same issue.

I have 2 x 3090 and 2 x A30. When I run bolt.new and select qwen2.5-coder:14b model (No ctx modified). ollama ps gives a result like blow:

屏幕截图 2024-11-20 004453
JaySurplus commented 2 days ago

I think you have set your ollama env : OLLAMA_NUM_PARALLEL=2 . . In your case, you need to set it to 1

AFAIK, you do not have to do 'ollama run model' to use it with Ottodev. As long as it has been downloaded and is listed in 'ollama list' it should work just fine. When you run it, you put it in memory and then you load it one more time when you use it in Ottodev. Hence the double memory issue. Hope this made sense.

this isnt whats happening, i showed the loaded model of ollama to show the acutal GPU used. when the UI loads it (clean) its double the size