dusty-nv / NanoLLM

Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.
https://dusty-nv.github.io/NanoLLM/
MIT License
196 stars 31 forks source link

NanoVLM live streaming demo works with VILA-2.7b, but not VILA1.5-3b #5

Closed rsun-bdti closed 6 months ago

rsun-bdti commented 6 months ago

I can run the NanoVLM live-streaming demo with model VILA-2.7b, but not with VILA1.5-3b. I wonder whether I am missing something, or there is a bug somewhere.

Platform: Jetson AGX Orin 64 GB Environment: L4T_VERSION=36.2.0 JETPACK_VERSION=6.0 CUDA_VERSION=12.2 Code base: https://github.com/dusty-nv/NanoLLM, version 24.4.2 Docker image: dustynv/nano_llm, tag 24.4-r36.2.0

Steps to repeat with VILA-2.7b:

  1. Use the command jetson-containers run $(autotag nano_llm) to launch the docker container.
  2. Within the container, use the command python3 -m nano_llm.agents.video_query --api=mlc --model Efficient-Large-Model/VILA-2.7b --max-context-len 768 --max-new-tokens 32 --video-input /dev/video0 --video-output webrtc://localhost:8554/output to start the demo.
  3. The demo runs as expected.

Steps to repeat with VILA1.5-3b:

  1. Use the command jetson-containers run $(autotag nano_llm) to launch the docker container.
  2. Within the container, use the command python3 -m nano_llm.agents.video_query --api=mlc --model Efficient-Large-Model/VILA1.5-3b --max-context-len 256 --max-new-tokens 32 --video-input /dev/video0 --video-output webrtc://localhost:8554/output to start the demo.
  3. The code crashes with the following error message:

    17:14:53 | INFO | loading /data/models/huggingface/models--Efficient-Large-Model--VILA1.5-3b/snapshots/699b413ed13620957e955bd7fb938852afa258fc with MLC 17:14:54 | INFO | running MLC quantization: python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256 Using path "/data/models/mlc/dist/models/VILA1.5-3b" for model "VILA1.5-3b" Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/mlc_llm/build.py", line 47, in main() File "/usr/local/lib/python3.10/dist-packages/mlc_llm/build.py", line 43, in main core.build_model_from_args(parsed_args) File "/usr/local/lib/python3.10/dist-packages/mlc_llm/core.py", line 859, in build_model_from_args mod, param_manager, params, model_config = model_generators[args.model_category].get_model( File "/usr/local/lib/python3.10/dist-packages/mlc_llm/relax_model/llama.py", line 1453, in get_model raise Exception( Exception: The model config should contain information about maximum sequence length. Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main Process Process-1: return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/opt/NanoLLM/nano_llm/agents/video_query.py", line 358, in agent = VideoQuery(vars(args)).run() File "/opt/NanoLLM/nano_llm/agents/video_query.py", line 44, in init self.llm = ProcessProxy('ChatQuery', model=model, drop_inputs=True, vision_scaling=vision_scaling, kwargs) #ProcessProxy((lambda kwargs: ChatQuery(model, drop_inputs=True, kwargs)), *kwargs) File "/opt/NanoLLM/nano_llm/plugins/process_proxy.py", line 38, in init raise RuntimeError(f"subprocess has an invalid initialization status ({init_msg['status']})") RuntimeError: subprocess has an invalid initialization status (<class 'subprocess.CalledProcessError'>) Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(self._args, self._kwargs) File "/opt/NanoLLM/nano_llm/plugins/process_proxy.py", line 132, in run_process raise error File "/opt/NanoLLM/nano_llm/plugins/process_proxy.py", line 126, in run_process self.plugin = ChatQuery(kwargs) File "/opt/NanoLLM/nano_llm/plugins/chat_query.py", line 70, in init self.model = NanoLLM.from_pretrained(model, kwargs) File "/opt/NanoLLM/nano_llm/nano_llm.py", line 73, in from_pretrained model = MLCModel(model_path, kwargs) File "/opt/NanoLLM/nano_llm/models/mlc.py", line 59, in init quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs) File "/opt/NanoLLM/nano_llm/models/mlc.py", line 278, in quantize subprocess.run(cmd, executable='/bin/bash', shell=True, check=True) File "/usr/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b-ctx256 ' returned non-zero exit status 1.

dusty-nv commented 6 months ago

Hi @rsun-bdti, can you try pulling dustynv/nano_llm:24.5-r36.2.0 instead? There were updates for VILA-1.5 support in the 24.5 release of NanoLLM: https://dusty-nv.github.io/NanoLLM/releases.html

rsun-bdti commented 6 months ago

Hi Dustin, thanks for your timely response. I pulled docker image 24.5-r36.2.0. However the same error persists. I guess I need to regenerate and/or to re-quantize the TensorRT model. How do I force the script to do that?

dusty-nv commented 6 months ago

You can try deleting the folders from jetson-containers/data/models/clip and jetson-containers/data/models/mlc/dist/vila1.5*

You can also try running --vision-api=hf to disable TensorRT with CLIP

Also make sure you are actually running the right container.

rsun-bdti commented 6 months ago

Hey Dustin, Thanks for your timely helps! Now the live-streaming demo with VILA1.5-3b is running.

One more question: I saw in your video of the live-streaming demo that you changed the prompt on the fly, while the demo was running. How did you do that? I have not figured out how to switch to other prompts in prompt_history.

Thanks again! That’s some great stuff. -Robby

dusty-nv commented 6 months ago

OK, great that you got it working @rsun-bdti ! If you navigate to the web UI on port 8050 (not the webRTC debug viewer on port 8554), then it should have a drop-down under the video stream that you can either enter prompts into, or select from the pre-populated prompts.

rsun-bdti commented 6 months ago

Great! Thanks a lot for your timely helps. I will close this issue.