NexaAI / nexa-sdk

Nexa SDK is a comprehensive toolkit for supporting GGML and ONNX models. It supports text generation, image generation, vision-language models (VLM), Audio Language Model, auto-speech-recognition (ASR), and text-to-speech (TTS) capabilities.
https://docs.nexa.ai/
Apache License 2.0
4.47k stars 659 forks source link

[QUESTION] Weird consumption of CPU/GPU (or lack of) #147

Open jcpraud opened 1 month ago

jcpraud commented 1 month ago

Question or Issue

Hi all,

I just installed and began to test NexaAI on my Win11 laptop... I first tested Qwen2.5:7b LLM, and...

What kind of magic is this? Or more probably, what did I miss?

As a security specialist, thus paranoid, I even ran my tests without any network cx to prevent cheating with online LLMs ;)

Cheers, JC

OS

Windows 11

Python Version

3.12.7

Nexa SDK Version

0.0.8.6

GPU (if using one)

NVIDIA RTX 3050 Ti

zhycheng614 commented 1 month ago

Thank you so much for bringing this question up. First of all, I would like to make sure that you are using the CUDA version of Nexa SDK for windows: $env:CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON"; pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir.

If your confirm the installation, then we can move forward on your question. However, as a developer of this SDK, I can guarantee that all inferences (which can be verified in our open-source repo) are completely on-device with no online LLM used :).

jcpraud commented 1 month ago

Yes, I used this command line for the install (as explained on the project page : https://github.com/NexaAI/nexa-sdk):

set CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON" & pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir

I tested gemma2-9b, there are activity pikes on the GPU: 100% usage every 3-4 (and up to 10) seconds. 3.8GB of the GPU 4GB VRAM is used. CPU usage is at 17% The GPU consumption remains lower than running the same model on Ollama, which consumes continuously 50% CPU and 30-40% GPU, and only 2.8 GB GPU VRAM. Token output seems 1.5 to 2x quicker on Ollama than Nexa, so no magic, in fact :)

I'm planning to test further at my work next week, on a VM with more CPU and RAM, but no GPU. Ollama is far slower in this env of course, I'll compare with Nexa loss of performance, with the same models and prompts.