dusty-nv / jetson-containers

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
MIT License
1.88k stars 416 forks source link

Unable to run Llamaspeak on Jetson Orin NX 16GB #563

Open JQZhai opened 1 week ago

JQZhai commented 1 week ago

Hi, I'm trying to run Llamaspeak following the Instructions on https://www.jetson-ai-lab.com/tutorial_llamaspeak.html

Specs: Jetson Orin NX(16GB) Developer Kit Jetpack 6.0 [L4T 36.3.0]

The RIVA server is up and running the ASR and TTS examples works just fine.

When i run the following code in /path/to/jetson-containers: root@ubuntu:/# python3 -m nano_llm.agents.web_chat --api=mlc --model /data/models/Meta-Llama-3-8B-Instruct/ --asr=riva --tts=piper

The Response is: /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead. warnings.warn( 06:43:49 | INFO | loading /data/models/Meta-Llama-3-8B-Instruct/ with MLC Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 06:43:51 | INFO | running MLC quantization:

python3 -m mlc_llm.build --model /data/models/mlc/dist/models/ --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 8192 --artifact-path /data/models/mlc/dist/-ctx8192

Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/mlc_llm/build.py", line 47, in main() File "/usr/local/lib/python3.10/dist-packages/mlc_llm/build.py", line 41, in main parsed_args = core._parse_args(parsed_args) # pylint: disable=protected-access File "/usr/local/lib/python3.10/dist-packages/mlc_llm/core.py", line 444, in _parse_args parsed = _setup_model_path(parsed) File "/usr/local/lib/python3.10/dist-packages/mlc_llm/core.py", line 494, in _setup_model_path validate_config(args.model_path) File "/usr/local/lib/python3.10/dist-packages/mlc_llm/core.py", line 529, in validate_config assert os.path.exists( AssertionError: Expecting HuggingFace config, but file not found: /data/models/mlc/dist/models/config.json. Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/opt/NanoLLM/nano_llm/agents/web_chat.py", line 327, in agent = WebChat(vars(args)) File "/opt/NanoLLM/nano_llm/agents/web_chat.py", line 32, in init super().init(kwargs) File "/opt/NanoLLM/nano_llm/agents/voice_chat.py", line 30, in init self.llm = ChatQuery(kwargs) #ProcessProxy('ChatQuery', kwargs)
File "/opt/NanoLLM/nano_llm/plugins/chat_query.py", line 78, in init self.model = NanoLLM.from_pretrained(model, kwargs) File "/opt/NanoLLM/nano_llm/nano_llm.py", line 73, in from_pretrained model = MLCModel(model_path, kwargs) File "/opt/NanoLLM/nano_llm/models/mlc.py", line 60, in init quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs) File "/opt/NanoLLM/nano_llm/models/mlc.py", line 271, in quantize subprocess.run(cmd, executable='/bin/bash', shell=True, check=True)
File "/usr/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/ --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 8192 --artifact-path /data/models/mlc/dist/-ctx8192 ' returned non-zero exit status 1.

Any help or recommendations are appreciated.

dusty-nv commented 6 days ago

Hi @JQZhai , normally there would be terminal output from the MLC subprocess when it is quantizing the model - perhaps your board is running out of memory? If you haven't already, have you tried disabling ZRAM and mounting SWAP yet? (https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md#mounting-swap)

Another recommendation would to be doing the quantization without all the other stuff running, just using the console-based nano_llm.chat program with llama-3-8b. It will run the quantization, then when you start llamaspeak it will already be cached on disk.

Also, you can try --asr=whisper instead, it uses less memory (remember to shut down the Riva server if you do that to get the savings)

JQZhai commented 5 days ago

After restarting orin I have run successfully, thanks for your work.