Closed MrDelusionAI closed 7 months ago
Hi @MrDelusionAI, I haven't used ollama, my understanding is that it uses llama.cpp though, and I have the container for that here. If you look at the dockerfiles for many of the containers on repo, there is a pattern of getting these 3rd-party projects to build on ARM64+CUDA with the correct configuration and settings, sometimes require patches/ect.
If this is the first LLM you are running on Jetson, I would try oogabooga first: https://www.jetson-ai-lab.com/tutorial_text-generation.html
That also can expose an OpenAI-compatible server endpoint, and llama.cpp has one too. So you could write an application using openai client python library, or llama.cpp has a python API (included in the container) which is good to use.
The fastest LLM inference currently available on Jetson is with MLC which I have a container for, and supported in my local_llm library which provides a HuggingFace Transformers and agent framework.
Great thanks Dusty, yeah I have ran your containers successfully, thanks for all the work you have done and made it available.
I will have a look and see if I can understand the patterns to see how to build these projects to work on Jetson.
Thanks again for the fast reply and all the information.
Evening all
I am getting my head into running llm's etc on the Jetson rather than my desktop pc with GPU, I have spent a few hours trying to understand how I can make calls to the GPU rather than CPU, like if I wanted to get projects like Ollama/Ollama Web and Fooocus to use the GPU what would be the easiest way to do that? Or is it more complicated?
Thanks all