dusty-nv / jetson-containers

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
MIT License
2.39k stars 483 forks source link

Llama-3-8B run inference error occurred:InternalError: Check failed: (offset + needed_size <= this->buffer.size) is false: storage allocation failure, attempted to allocate 513024 at offset 0 in region that is 163840bytes #558

Open mylinfh opened 5 months ago

mylinfh commented 5 months ago

🐛 Bug

I use the jetson-containers of MLC and use Meta-Llama-3-8B-Instruct model . after I run ``` python3 -m mlc_llm.build \ --model Meta-Llama-3-8B-Instruct-hf \ --quantization q4f16_ft \ --target cuda \ --use-cuda-graph \ --use-flash-attn-mqa \ --sep-embed \ --max-seq-len 8192 \ --artifact-path /data/models/mlc/dist \ --use-safetensors

It completed the quantification and did not report any errors.
but I  find it show error when I run inference:

python3 /opt/mlc-llm/benchmark.py \ --model /data/models/mlc/dist/Meta-Llama-3-8B-Instruct-hf-ctx8192/Meta-Llama-3-8B-Instruct-hf-q4f16_ft/params \ --prompt "Can you tell me a joke about llamas?" \ --max-new-tokens 128


The following error occurred:
`Namespace(chat=False, max_new_tokens=128, max_num_prompts=None, model='/data/models/mlc/dist/Meta-Llama-3-8B-Instruct-hf-q4f16_ft/params', model_lib_path=None, prompt=['Can you tell me a joke about llamas?'], save='', streaming=False) -- loading /data/models/mlc/dist/Meta-Llama-3-8B-Instruct-hf-q4f16_ft/params  PROMPT:  Can you tell me a joke about llamas?  Traceback (most recent call last):   File "/opt/mlc-llm/benchmark.py", line 135, in <module>     print(cm.benchmark_generate(prompt=prompt, generate_length=args.max_new_tokens).strip())   File "/usr/local/lib/python3.8/dist-packages/mlc_chat/chat_module.py", line 910, in benchmark_generate     self._prefill(prompt)   File "/usr/local/lib/python3.8/dist-packages/mlc_chat/chat_module.py", line 997, in _prefill     self._prefill_func(   File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__   File "tvm/_ffi/_cython/./packed_func.pxi", line 277, in tvm._ffi._cy3.core.FuncCall   File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL   File "/usr/local/lib/python3.8/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error     raise py_err tvm.error.InternalError: Traceback (most recent call last):   [bt] (8) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)+0x230) [0xffff6c51f6c8]   [bt] (7) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()+0x210) [0xffff6c51dd58]   [bt] (6) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)+0x5e4) [0xffff6c51e5bc]   [bt] (5) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x7c) [0xffff6c51c9fc]   [bt] (4) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::NDArray (tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)>::AssignTypedLambda<tvm::runtime::Registry::set_body_method<tvm::runtime::memory::Storage, tvm::runtime::memory::StorageObj, tvm::runtime::NDArray, long, tvm::runtime::ShapeTuple, DLDataType, void>(tvm::runtime::NDArray (tvm::runtime::memory::StorageObj::*)(long, tvm::runtime::ShapeTuple, DLDataType))::{lambda(tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)#1}>(tvm::runtime::Registry::set_body_method<tvm::runtime::memory::Storage, tvm::runtime::memory::StorageObj, tvm::runtime::NDArray, long, tvm::runtime::ShapeTuple, DLDataType, void>(tvm::runtime::NDArray (tvm::runtime::memory::StorageObj::*)(long, tvm::runtime::ShapeTuple, DLDataType))::{lambda(tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMRetValue)+0x10) [0xffff6c4ea638]   [bt] (3) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::TypedPackedFunc<tvm::runtime::NDArray (tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)>::AssignTypedLambda<tvm::runtime::Registry::set_body_method<tvm::runtime::memory::Storage, tvm::runtime::memory::StorageObj, tvm::runtime::NDArray, long, tvm::runtime::ShapeTuple, DLDataType, void>(tvm::runtime::NDArray (tvm::runtime::memory::StorageObj::*)(long, tvm::runtime::ShapeTuple, DLDataType))::{lambda(tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)#1}>(tvm::runtime::Registry::set_body_method<tvm::runtime::memory::Storage, tvm::runtime::memory::StorageObj, tvm::runtime::NDArray, long, tvm::runtime::ShapeTuple, DLDataType, void>(tvm::runtime::NDArray (tvm::runtime::memory::StorageObj::*)(long, tvm::runtime::ShapeTuple, DLDataType))::{lambda(tvm::runtime::memory::Storage, long, tvm::runtime::ShapeTuple, DLDataType)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const, tvm::runtime::TVMRetValue) const+0x27c) [0xffff6c4ea374]   [bt] (2) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::memory::StorageObj::AllocNDArray(long, tvm::runtime::ShapeTuple, DLDataType)+0x3a8) [0xffff6c4998c8]   [bt] (1) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x78) [0xffff6a0edf58]   [bt] (0) /usr/local/lib/python3.8/dist-packages/tvm/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff6c4966f0]   File "/opt/mlc-llm/3rdparty/tvm/src/runtime/memory/memory_manager.cc", line 108 InternalError: Check failed: (offset + needed_size <= this->buffer.size) is false: storage allocation failure, attempted to allocate 513024 at offset 0 in region that is 163840bytes `

I tried with different max-seq-len. it returns the same. It's worth mentioning that when I use Meta-Llama-2-7b for quantization and then run inference, there are no errors. However, when using Meta-Llama-3-8B or Meta-Llama-3-8B-Instruct for inference, the same errors as mentioned above occur.

What should I do?
Thanks.

## Environment

 - Platform (jetson orin)
dusty-nv commented 5 months ago

@mylinfh I see you are on JetPack 5 presumably...I recall Llama-3 needing an updated version of MLC, however newer versions stopped building on JetPack 5. So unfortunately it requires upgrading to JetPack 6 to use it with MLC, or you can run it through another LLM backend like llama.cpp/ollama.

mylinfh commented 5 months ago

Okay, thank you. I'll try again.llama.cpp/ollama can be used, but the inference time seems to be longer

dusty-nv commented 5 months ago

@mylinfh if you try running this through nano_llm with --api=mlc , it may still work on JetPack 5

I discovered that llama-3 requires newer MLC when used standalone (my mlc:0.1.1 version of container - which doesn't build for JP5), but through NanoLLM it works with older mlc:0.1.0 version because I use --sep-embed flag in NanoLLM when building the LLM model (which runs the embedding layer separately, and this error seems inside the embedding layer)

mylinfh commented 5 months ago

Mmm, yes, Thanks for your reply. I can run llama3 using nanoLLM. But I also tried deploying inference on nanollm and mlc using Llama2-7B, both of which were fast, but mlc seems to be faster. So I want to see if llama3 still performs the same on MLC.