Open jcpraud opened 1 month ago
Thank you so much for bringing this question up. First of all, I would like to make sure that you are using the CUDA version of Nexa SDK for windows:
$env:CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON"; pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir
.
If your confirm the installation, then we can move forward on your question. However, as a developer of this SDK, I can guarantee that all inferences (which can be verified in our open-source repo) are completely on-device with no online LLM used :).
Yes, I used this command line for the install (as explained on the project page : https://github.com/NexaAI/nexa-sdk):
set CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON" & pip install nexaai --prefer-binary --index-url https://nexaai.github.io/nexa-sdk/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir
I tested gemma2-9b, there are activity pikes on the GPU: 100% usage every 3-4 (and up to 10) seconds. 3.8GB of the GPU 4GB VRAM is used. CPU usage is at 17% The GPU consumption remains lower than running the same model on Ollama, which consumes continuously 50% CPU and 30-40% GPU, and only 2.8 GB GPU VRAM. Token output seems 1.5 to 2x quicker on Ollama than Nexa, so no magic, in fact :)
I'm planning to test further at my work next week, on a VM with more CPU and RAM, but no GPU. Ollama is far slower in this env of course, I'll compare with Nexa loss of performance, with the same models and prompts.
Question or Issue
Hi all,
I just installed and began to test NexaAI on my Win11 laptop... I first tested Qwen2.5:7b LLM, and...
What kind of magic is this? Or more probably, what did I miss?
As a security specialist, thus paranoid, I even ran my tests without any network cx to prevent cheating with online LLMs ;)
Cheers, JC
OS
Windows 11
Python Version
3.12.7
Nexa SDK Version
0.0.8.6
GPU (if using one)
NVIDIA RTX 3050 Ti