Open mylinfh opened 5 months ago
@mylinfh I see you are on JetPack 5 presumably...I recall Llama-3 needing an updated version of MLC, however newer versions stopped building on JetPack 5. So unfortunately it requires upgrading to JetPack 6 to use it with MLC, or you can run it through another LLM backend like llama.cpp/ollama.
Okay, thank you. I'll try again.llama.cpp/ollama can be used, but the inference time seems to be longer
@mylinfh if you try running this through nano_llm with --api=mlc
, it may still work on JetPack 5
I discovered that llama-3 requires newer MLC when used standalone (my mlc:0.1.1
version of container - which doesn't build for JP5), but through NanoLLM it works with older mlc:0.1.0
version because I use --sep-embed
flag in NanoLLM when building the LLM model (which runs the embedding layer separately, and this error seems inside the embedding layer)
Mmm, yes, Thanks for your reply. I can run llama3 using nanoLLM. But I also tried deploying inference on nanollm and mlc using Llama2-7B, both of which were fast, but mlc seems to be faster. So I want to see if llama3 still performs the same on MLC.
🐛 Bug
I use the jetson-containers of MLC and use Meta-Llama-3-8B-Instruct model . after I run ``` python3 -m mlc_llm.build \ --model Meta-Llama-3-8B-Instruct-hf \ --quantization q4f16_ft \ --target cuda \ --use-cuda-graph \ --use-flash-attn-mqa \ --sep-embed \ --max-seq-len 8192 \ --artifact-path /data/models/mlc/dist \ --use-safetensors
python3 /opt/mlc-llm/benchmark.py \ --model /data/models/mlc/dist/Meta-Llama-3-8B-Instruct-hf-ctx8192/Meta-Llama-3-8B-Instruct-hf-q4f16_ft/params \ --prompt "Can you tell me a joke about llamas?" \ --max-new-tokens 128