Ollama & Fooocus - Githubissues

MrDelusionAI commented 7 months ago

Evening all

I am getting my head into running llm's etc on the Jetson rather than my desktop pc with GPU, I have spent a few hours trying to understand how I can make calls to the GPU rather than CPU, like if I wanted to get projects like Ollama/Ollama Web and Fooocus to use the GPU what would be the easiest way to do that? Or is it more complicated?

Thanks all

dusty-nv commented 7 months ago

Hi @MrDelusionAI, I haven't used ollama, my understanding is that it uses llama.cpp though, and I have the container for that here. If you look at the dockerfiles for many of the containers on repo, there is a pattern of getting these 3rd-party projects to build on ARM64+CUDA with the correct configuration and settings, sometimes require patches/ect.

If this is the first LLM you are running on Jetson, I would try oogabooga first: https://www.jetson-ai-lab.com/tutorial_text-generation.html

That also can expose an OpenAI-compatible server endpoint, and llama.cpp has one too. So you could write an application using openai client python library, or llama.cpp has a python API (included in the container) which is good to use.

The fastest LLM inference currently available on Jetson is with MLC which I have a container for, and supported in my local_llm library which provides a HuggingFace Transformers and agent framework.

MrDelusionAI commented 7 months ago

Great thanks Dusty, yeah I have ran your containers successfully, thanks for all the work you have done and made it available.

I will have a look and see if I can understand the patterns to see how to build these projects to work on Jetson.

Thanks again for the fast reply and all the information.

dusty-nv / jetson-containers

Ollama & Fooocus #370