dusty-nv / jetson-containers

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
MIT License
1.89k stars 416 forks source link

Not sure if ollama:r36.2.0 is using GPU #491

Open UserName-wang opened 2 months ago

UserName-wang commented 2 months ago

Dear @dusty-nv , I pulled dustynv/ollama:r36.2.0 on jeston orin 32G DEV. run command: jetson-containers run --name ollama $(autotag ollama), the output are: [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.

[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(Server).ChatHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(Server).GenerateRoutes.func2 (5 handlers) time=2024-04-27T00:32:16.148Z level=INFO source=routes.go:1064 msg="Listening on [::]:11434 (version 0.0.0)" time=2024-04-27T00:32:16.149Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama359642117/runners time=2024-04-27T00:32:26.579Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cuda_v12]" time=2024-04-27T00:32:26.579Z level=INFO source=gpu.go:96 msg="Detecting GPUs" time=2024-04-27T00:32:26.657Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama359642117/runners/cuda_v12/libcudart.so.12 count=1 time=2024-04-27T00:32:26.658Z level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions"

and in container I tried several models: llama3:latest, llava:34b I checked GPU usage by command: nvidia-smi, the output always are: Sat Apr 27 08:27:01 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 540.2.0 Driver Version: N/A CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Orin (nvgpu) N/A | N/A N/A | N/A | | N/A N/A N/A N/A / N/A | Not Supported | N/A N/A | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ Seems GPU is not used by ollama?

the token output of llama3:latest is quite fast, but llava:34b is quite slow and the CPU usage of llava:34b is quite high than llama3.

TadayukiOkada commented 2 months ago

have you tried jtop?: https://github.com/rbonghi/jetson_stats I see GPU activity with jtop when I run Ollama. I'm using jetpack 5.1.3 though. note llama3:latest is 8b's q4. so it's reasonable that llava:34b is slower. if you try llama3:70b, it's going to be much slower than llava:34b (you might not be able to run 70b on 32GB RAM Orin)

dusty-nv commented 2 months ago

@UserName-wang yes nvidia-smi isn't very supported on Jetson, as @TadayukiOkada suggested use jtop or tegrastats instead, and for optimized VLM see https://www.jetson-ai-lab.com/tutorial_nano-vlm.html

UserName-wang commented 2 months ago

thank you for your help! @dusty-nv @TadayukiOkada Jtop said GPU was using.