Open JQZhai opened 1 week ago
Hi @JQZhai , normally there would be terminal output from the MLC subprocess when it is quantizing the model - perhaps your board is running out of memory? If you haven't already, have you tried disabling ZRAM and mounting SWAP yet? (https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md#mounting-swap)
Another recommendation would to be doing the quantization without all the other stuff running, just using the console-based nano_llm.chat
program with llama-3-8b. It will run the quantization, then when you start llamaspeak it will already be cached on disk.
Also, you can try --asr=whisper
instead, it uses less memory (remember to shut down the Riva server if you do that to get the savings)
After restarting orin I have run successfully, thanks for your work.
Hi, I'm trying to run Llamaspeak following the Instructions on https://www.jetson-ai-lab.com/tutorial_llamaspeak.html
Specs: Jetson Orin NX(16GB) Developer Kit Jetpack 6.0 [L4T 36.3.0]
The RIVA server is up and running the ASR and TTS examples works just fine.
When i run the following code in /path/to/jetson-containers:
root@ubuntu:/# python3 -m nano_llm.agents.web_chat --api=mlc --model /data/models/Meta-Llama-3-8B-Instruct/ --asr=riva --tts=piper
The Response is: /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using
TRANSFORMERS_CACHE
is deprecated and will be removed in v5 of Transformers. UseHF_HOME
instead. warnings.warn( 06:43:49 | INFO | loading /data/models/Meta-Llama-3-8B-Instruct/ with MLC Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 06:43:51 | INFO | running MLC quantization:python3 -m mlc_llm.build --model /data/models/mlc/dist/models/ --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 8192 --artifact-path /data/models/mlc/dist/-ctx8192
Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/mlc_llm/build.py", line 47, in
main()
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/build.py", line 41, in main
parsed_args = core._parse_args(parsed_args) # pylint: disable=protected-access
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/core.py", line 444, in _parse_args
parsed = _setup_model_path(parsed)
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/core.py", line 494, in _setup_model_path
validate_config(args.model_path)
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/core.py", line 529, in validate_config
assert os.path.exists(
AssertionError: Expecting HuggingFace config, but file not found: /data/models/mlc/dist/models/config.json.
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/NanoLLM/nano_llm/agents/web_chat.py", line 327, in
agent = WebChat(vars(args))
File "/opt/NanoLLM/nano_llm/agents/web_chat.py", line 32, in init
super().init(kwargs)
File "/opt/NanoLLM/nano_llm/agents/voice_chat.py", line 30, in init
self.llm = ChatQuery(kwargs) #ProcessProxy('ChatQuery', kwargs)
File "/opt/NanoLLM/nano_llm/plugins/chat_query.py", line 78, in init self.model = NanoLLM.from_pretrained(model, kwargs) File "/opt/NanoLLM/nano_llm/nano_llm.py", line 73, in from_pretrained model = MLCModel(model_path, kwargs) File "/opt/NanoLLM/nano_llm/models/mlc.py", line 60, in init quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs) File "/opt/NanoLLM/nano_llm/models/mlc.py", line 271, in quantize subprocess.run(cmd, executable='/bin/bash', shell=True, check=True)
File "/usr/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/ --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 8192 --artifact-path /data/models/mlc/dist/-ctx8192 ' returned non-zero exit status 1.
Any help or recommendations are appreciated.