dusty-nv / jetson-containers

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
MIT License
2.09k stars 435 forks source link

Unable to run minigpt4 on Jetson Orin Nano 8GB #342

Open jedld opened 9 months ago

jedld commented 9 months ago

Tried running the minigpt4 webui demo (https://www.jetson-ai-lab.com/tutorial_minigpt4.html) and my device keeps locking up after running the run script. Figured I may lack the memory resources to do it, hence I opted for a smaller model and followed the instructions in tuning the memory here:

https://www.jetson-ai-lab.com/tips_ram-optimization.html

So I setup swap, ran headless and disable some services.

joseph@joseph-orin:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:          7.3Gi       544Mi       6.1Gi        18Mi       702Mi       6.6Gi
Swap:          22Gi          0B        22Gi

Ran the benchmark again headless using the commands below:

joseph@joseph-orin:~/workspace/jetson-containers$ ./run.sh --workdir=/opt/minigpt4.cpp/minigpt4/ $(./autotag minigpt4) /bin/bash -c \
> 'python3 benchmark.py --max-new-tokens=32 --runs=1 \
>   $(huggingface-downloader --type=dataset maknee/minigpt4-7b-ggml/minigpt4-7B-q3_k.bin) \
>   $(huggingface-downloader --type=dataset maknee/ggml-vicuna-v0-quantized/ggml-vicuna-7B-v0-q3_k.bin)'

Still got killed.

Namespace(packages=['minigpt4'], prefer=['local', 'registry', 'build'], disable=[''], user='dustynv', output='/tmp/autotag', quiet=False, verbose=False)
-- L4T_VERSION=35.4.1  JETPACK_VERSION=5.1.2  CUDA_VERSION=11.4.315
-- Finding compatible container image for ['minigpt4']
dustynv/minigpt4:r35.3.1
+ docker run --runtime nvidia -it --rm --network host --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /home/joseph/workspace/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb --workdir=/opt/minigpt4.cpp/minigpt4/ dustynv/minigpt4:r35.3.1 /bin/bash -c 'python3 benchmark.py --max-new-tokens=32 --runs=1 \
  $(huggingface-downloader --type=dataset maknee/minigpt4-7b-ggml/minigpt4-7B-q3_k.bin) \
  $(huggingface-downloader --type=dataset maknee/ggml-vicuna-v0-quantized/ggml-vicuna-7B-v0-q3_k.bin)'
  --prompt "What does the sign say?" --prompt "How far is the exit?" --prompt "What would happen next?" \
  --image /data/images/hoover.jpg \
  --run 3 \
  --save /data/minigpt4.csv'
Downloading maknee/minigpt4-7b-ggml/minigpt4-7B-q3_k.bin to /data/models/huggingface

repo_id  maknee/minigpt4-7b-ggml
filename minigpt4-7B-q3_k.bin

Downloaded maknee/minigpt4-7b-ggml/minigpt4-7B-q3_k.bin to: /data/models/huggingface/datasets--maknee--minigpt4-7b-ggml/snapshots/79340163b5a9a37908610cc71a7e1dd3c7f77889/minigpt4-7B-q3_k.bin

Downloading maknee/ggml-vicuna-v0-quantized/ggml-vicuna-7B-v0-q3_k.bin to /data/models/huggingface

repo_id  maknee/ggml-vicuna-v0-quantized
filename ggml-vicuna-7B-v0-q3_k.bin

Downloaded maknee/ggml-vicuna-v0-quantized/ggml-vicuna-7B-v0-q3_k.bin to: /data/models/huggingface/datasets--maknee--ggml-vicuna-v0-quantized/snapshots/1d8789f34eb803bf52daf895c7ecfd2559cf5ccc/ggml-vicuna-7B-v0-q3_k.bin
Namespace(image='/data/images/hoover.jpg', llm_model_path='/data/models/huggingface/datasets--maknee--ggml-vicuna-v0-quantized/snapshots/1d8789f34eb803bf52daf895c7ecfd2559cf5ccc/ggml-vicuna-7B-v0-q3_k.bin', max_new_tokens=32, model_path='/data/models/huggingface/datasets--maknee--minigpt4-7b-ggml/snapshots/79340163b5a9a37908610cc71a7e1dd3c7f77889/minigpt4-7B-q3_k.bin', prompt=['What does the sign in the image say?', 'How far is the exit?', 'What kind of environment is it in?', "Does it look like it's going to rain?"], runs=1, save='', warmup=1)
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7
llama.cpp: loading model from /data/models/huggingface/datasets--maknee--ggml-vicuna-v0-quantized/snapshots/1d8789f34eb803bf52daf895c7ecfd2559cf5ccc/ggml-vicuna-7B-v0-q3_k.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 12 (mostly Q3_K - Medium)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  =  468.40 MB (+ 1024.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 4632 MB
llama_new_context_with_model: kv self size  = 1024.00 MB
INFO: LLM model init took 8089 ms to complete
INFO: Model name: visual_encoder
INFO: Model name: ln_vision
INFO: Model name: query_tokens
INFO: Model name: Qformer
INFO: Model name: llama_proj
INFO: Load file took 500 ms to complete
INFO: Model type: Vicuna7B
INFO: Model size: 458.5078125 MB
INFO: Loading minigpt4 model took 3 ms to complete
INFO: Load model from file took 12688 ms to complete
-- opening /data/images/hoover.jpg
./run.sh: line 72:  4223 Killed                  $SUDO docker run --runtime nvidia -it --rm --network host --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume $ROOT/data:/data --device /dev/snd --device /dev/bus/usb $DATA_VOLUME $DISPLAY_DEVICE $V4L2_DEVICES "$@"`

Running dmesg confirms it:

`[  756.721433] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/colord.service,task=colord,pid=968,uid=119
[  756.721477] Out of memory: Killed process 968 (colord) total-vm:246928kB, anon-rss:36kB, file-rss:0kB, shmem-rss:0kB, UID:119 pgtables:112kB oom_score_adj:0
[  756.723986] systemd-journald[243]: /dev/kmsg buffer overrun, some messages lost.
[  756.772964] Out of memory: Killed process 2864 (ubuntu-report) total-vm:702196kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:144kB oom_score_adj:0
[  756.803160] Out of memory: Killed process 3010 (goa-daemon) total-vm:552132kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:244kB oom_score_adj:0
[  756.835238] Out of memory: Killed process 543 (haveged) total-vm:7948kB, anon-rss:0kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:48kB oom_score_adj:0
[  756.871944] Out of memory: Killed process 552 (systemd-resolve) total-vm:24652kB, anon-rss:60kB, file-rss:0kB, shmem-rss:0kB, UID:105 pgtables:88kB oom_score_adj:0
[  756.900103] Out of memory: Killed process 1339 (whoopsie) total-vm:254640kB, anon-rss:0kB, file-rss:4kB, shmem-rss:0kB, UID:109 pgtables:128kB oom_score_adj:0
[  756.923393] Out of memory: Killed process 2033 (cups-proxyd) total-vm:242344kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:96kB oom_score_adj:0

The webui shows similar behavior, after adding swap I got to the point where it can show the gradio UI, however attempting to upload an image causes the same OOM issue.

Would appreciate any tips or help.

CurlyWurly-1 commented 8 months ago

Getting the same problem using the latest jetpack and the recommended actions. Hope this comes back soon to the orin nano

jedld commented 8 months ago

I got it somewhat working by using the lowest quantization possible and using CPU only (removing the CUDA options in the minigpt4 docker build). But it would be nice if it worked out of the box.

UserName-wang commented 7 months ago

I'm using agx xavier (which have 32G ), and can run successfully. but the response frequency is too slow. I also tried on agx orin, it's the same. is that normal?

dusty-nv commented 7 months ago

@UserName-wang Mini-GPT4 is not the most optimized, Llava we have more optimized. You can see the benchmarks at https://www.jetson-ai-lab.com/benchmarks.html

For the most optimized VLM pipeline with Llava-1.5, see https://www.jetson-ai-lab.com/tutorial_llava.html#4-optimized-multimodal-pipeline-with-local_llm