dusty-nv / jetson-containers

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
MIT License
2.18k stars 446 forks source link

add ollama build container to packages #465

Closed remy415 closed 6 months ago

remy415 commented 6 months ago

adds ollama docker build to jetson-containers packages. test scripts hang on my system, but the containers build successfully.

remy415 commented 6 months ago

@dusty-nv I could use some help with insight into the testing/build process if you have some time. I've created a PR to add an ollama package to jetson-containers. The container seems to build correctly but it keeps hanging on the test phase regardless of what I change in the test script or in the config.py file.

I removed the test.sh scripts from the Dockerfile and the config.py, but it still tries to load them and hangs and I have to ctrl-c to exit. I even let it sit for a couple hours once with no update to the command line.

Here's the end of the log after the build:

Successfully built 9fd96e933ff4
Successfully tagged 10.8.8.8:5001/ollama-r35.4.1-ollama:latest
-- Testing container 10.8.8.8:5001/ollama-r35.4.1-ollama (ollama/test.sh)

docker run -t --rm --runtime=nvidia --network=host \
--volume /home/tegra/ok3d/ollama-container/dev/jetson-containers/packages/llm/ollama:/test \
--volume /home/tegra/ok3d/ollama-container/dev/jetson-containers/data:/data \
--workdir /test \
10.8.8.8:5001/ollama-r35.4.1-ollama \
/bin/bash -c '/bin/bash test.sh' \
2>&1 | tee /home/tegra/ok3d/ollama-container/dev/jetson-containers/logs/20240404_165837/test/10.8.8.8_5001_ollama-r35.4.1-ollama_test.sh.txt; exit ${PIPESTATUS[0]}

root@ok3d-1:/test#

^CTraceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/tegra/ok3d/ollama-container/dev/jetson-containers/jetson_containers/build.py", line 102, in <module>
    build_container(args.name, args.packages, args.base, args.build_flags, args.simulate, args.skip_tests, args.test_only, args.push, args.no_github_api)
  File "/home/tegra/ok3d/ollama-container/dev/jetson-containers/jetson_containers/container.py", line 150, in build_container
    test_container(container_name, pkg, simulate)
  File "/home/tegra/ok3d/ollama-container/dev/jetson-containers/jetson_containers/container.py", line 322, in test_container
    status = subprocess.run(cmd.replace(_NEWLINE_, ' '), executable='/bin/bash', shell=True, check=True)
  File "/usr/lib/python3.8/subprocess.py", line 495, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "/usr/lib/python3.8/subprocess.py", line 1020, in communicate
    self.wait()
  File "/usr/lib/python3.8/subprocess.py", line 1083, in wait
    return self._wait(timeout=timeout)
  File "/usr/lib/python3.8/subprocess.py", line 1806, in _wait
    (pid, sts) = self._try_wait(0)
  File "/usr/lib/python3.8/subprocess.py", line 1764, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
dusty-nv commented 6 months ago

Thanks @remy415, we appreciate it! 🙏😄

I will ty this and figure out what is going on with the tests! The tests aside, are you actually able to run a model through ollama in the container?

remy415 commented 6 months ago

The bottom screenshot green GPU area was nearly 100% usage throughout the response. It answered my question very quickly, I expected it to take longer as I'm used to running Mistral 7b but it worked fantastically.

I did need to ensure I set OLLAMA_HOST=<node ip> on the "client" side to make sure it knew where to connect.

tegra@ok3d-1:~/ok3d/ollama-container/dev/jetson-containers$ docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama-rc4 10.8.8.8:5001/ollama-r35.4.1-ollama:latest
71964ec5430e741ac7778fab3fdf6d2035d3d12a8054f87d8c37283d782249da
tegra@ok3d-1:~/ok3d/ollama-container/dev/jetson-containers$ docker logs ollama-rc4
time=2024-04-04T20:51:03.561Z level=INFO source=images.go:793 msg="total blobs: 5"
time=2024-04-04T20:51:03.562Z level=INFO source=images.go:800 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.PullModelHandler (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.GenerateHandler (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.ChatHandler (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.EmbeddingsHandler (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.CreateModelHandler (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.PushModelHandler (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.CopyModelHandler (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.DeleteModelHandler (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.ShowModelHandler (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.CreateBlobHandler (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.HeadBlobHandler (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.ChatHandler (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.ListModelsHandler (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-04-04T20:51:03.563Z level=INFO source=routes.go:1121 msg="Listening on [::]:11434 (version 0.0.0)"
time=2024-04-04T20:51:03.564Z level=INFO source=payload.go:28 msg="extracting embedded files" dir=/tmp/ollama1042252847/runners
time=2024-04-04T20:51:11.918Z level=INFO source=payload.go:41 msg="Dynamic LLM libraries [cpu cuda_v11]"
time=2024-04-04T20:51:11.919Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
time=2024-04-04T20:51:11.919Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
time=2024-04-04T20:51:11.924Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1042252847/runners/cuda_v11/libcudart.so.11.0 /usr/local/cuda/lib64/libcudart.so.11.4.298 /usr/local/cuda/targets/aarch64-linux/lib/libcudart.so.11.4.298 /usr/local/cuda-11/targets/aarch64-linux/lib/libcudart.so.11.4.298 /usr/local/cuda-11.4/targets/aarch64-linux/lib/libcudart.so.11.4.298]"
time=2024-04-04T20:51:11.940Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
time=2024-04-04T20:51:11.940Z level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions"
time=2024-04-04T20:51:12.024Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.7"
tegra@ok3d-1:~/ok3d/ollama-container/dev/jetson-containers$ docker run -itu0 --rm -e OLLAMA_HOST=10.8.8.101 10.8.8.8:5001/ollama-r35.4.1-ollama:latest '/bin/ollama run tinyll
ama'
pulling manifest
pulling 2af3b81862c6... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 637 MB
pulling af0ddbdaaa26... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   70 B
pulling c8472cd9daed... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   31 B
pulling fa956ab37b8c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   98 B
pulling 6331358be52a... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  483 B
verifying sha256 digest
writing manifest
removing any unused layers
success
>>> What is CUDA?
CUDA (Computational Understanding for Dynamic Assessment) is an open-source software that provides comprehensive and automated tools for analyzing complex real-world
problems in various domains, such as engineering, finance, and healthcare. It allows users to model, simulate, and analyze complex systems in a fast, efficient, and
error-free manner, with minimal human intervention. CUDA is designed to help solve critical problems in fields such as energy, materials science, and biomedical
research, among others.

>>> /bye

image

remy415 commented 6 months ago

Note that with the -v ollama:/root/.ollama flag, it will create a local docker volume and mount it in the container at /root/.ollama. ollama will automatically download model files and store them there. If you wish to have a directory or location outside of the container to store those, should change it to -v <path to local folder>:/root/.ollama

remy415 commented 6 months ago

14 seconds to execute the docker run, type in "What is CUDA?", get a response, and type in "/bye". Tinyllama is fast on the Orin Nano, though I think it's a bit confused as to the meaning of CUDA. Can't complain though, the model is <1Gb.

tegra@ok3d-1:~/ok3d/ollama-container/dev/jetson-containers$ date
Thu 04 Apr 2024 09:34:25 PM UTC
tegra@ok3d-1:~/ok3d/ollama-container/dev/jetson-containers$ docker run -itu0 --rm -e OLLAMA_HOST=10.8.8.101 10.8.8.8:5001/ollama-r35.4.1-ollama:latest '/bin/ollama run tinyllama'
>>> What is CUDA?
CUDA, or Comprehensive Undergraduate Data Analytics, is an undergraduate analytics program offered by the University of Pennsylvania (Penn). The program emphasizes the
use of statistical computing and data analysis techniques to solve complex business problems. It provides a foundation in mathematics and statistics, as well as
practical applications in various industries such as finance, marketing, and healthcare. The program is designed for students who want to pursue careers in analytics
and is offered on campus and online.

>>> /bye
tegra@ok3d-1:~/ok3d/ollama-container/dev/jetson-containers$ date
Thu 04 Apr 2024 09:34:39 PM UTC
tegra@ok3d-1:~/ok3d/ollama-container/dev/jetson-containers$tegra@ok3d-1:~/ok3d/ollama-container/dev/jetson-containers$ date
Thu 04 Apr 2024 09:34:25 PM UTC
tegra@ok3d-1:~/ok3d/ollama-container/dev/jetson-containers$ docker run -itu0 --rm -e OLLAMA_HOST=10.8.8.101 10.8.8.8:5001/ollama-r35.4.1-ollama:latest '/bin/ollama run tinyllama'
>>> What is CUDA?
CUDA, or Comprehensive Undergraduate Data Analytics, is an undergraduate analytics program offered by the University of Pennsylvania (Penn). The program emphasizes the
use of statistical computing and data analysis techniques to solve complex business problems. It provides a foundation in mathematics and statistics, as well as
practical applications in various industries such as finance, marketing, and healthcare. The program is designed for students who want to pursue careers in analytics
and is offered on campus and online.

>>> /bye
tegra@ok3d-1:~/ok3d/ollama-container/dev/jetson-containers$ date
Thu 04 Apr 2024 09:34:39 PM UTC
dusty-nv commented 6 months ago

Ok great, seems like it is working good and using GPU, awesome work!

When I merge/test this, I will add jetson-containers/data/models/ollama and symlink it to /root/.ollama inside the container, so the models will automatically be cached there (that jetson-containers/data location is automatically mounted into the container under /data if you use run.sh)

remy415 commented 6 months ago

@dusty-nv great, thank you!

I forgot to mention that I couldn’t find a lot of documentation on the implementation of the various benchmarks I found in the packages folder. The benchmark I have set up will run the server on the current terminal then will attempt to curl the api. It may be better to assume the backend is running and just pass in the server ip for the curl.

dusty-nv commented 6 months ago

@remy415 ok!, made some minor tweaks to ollama container in https://github.com/dusty-nv/jetson-containers/commit/413d5aff9d3120b29fdef697575ed762979e6a89 , got it working, pushed images for JP5/JP6 to DockerHub, and merged it into master - thank you for everything and all the upstream work to enable this on Jetson!

Given your contributions, please feel free to make a new topic on the Jetson Projects forum announcing this, I know the community has been asking for it. Otherwise I will post it next week.

Some notes:

remy415 commented 6 months ago

Awesome, I'm really excited for this, thank you!

I've already talked about some of these options for building with the Ollama devs, and I think the solution is to provide custom build flags for the containers. I will share some of their feedback here:

LLAMA_CUDA_F16 is disabled for the original Nano, but should be able to just detect if 53 is in CMAKE_CUDA_ARCHITECTURES (and if not, enable FP16 for Xavier/Orin which are the primary platforms for edge LLM nowadays)

The ollama developers wanted to prioritize compatibility over smaller performance gains for their general binary distribution. I disabled LLAMA_CUDA_F16 because it wasn't compiling when I had CMAKE_CUDA_ARCHITECTURES < 60. They were hesitant to include a "jetson only" build in their general build script, which I totally get since it is a smaller market and they otherwise mostly share the same code with standard linux + cuda builds.

Correct me if I'm wrong, but I believe LLAMA_CUDA_FORCE_MMQ=on is for batching, which Jetson users aren't using really and instead could benefit from CUDA_USE_TENSOR_CORES instead

You are absolutely correct, and the reason for this is again compatibility with older systems. They said they didn't see a substantial performance gain on tests having this option enabled, which may be true for beefier cards but I had a theory that this may make a difference on Jetson devices.

I will play around with build flags and see if I can override those options with the current ollama build(s). If not, I will make a PR with them to include support for changing options on the fly. I think the key might be with the custom cpu flags, if that works I will post an update.