agx xavier dustynv/local_llm:r35.3.1 error

UserName-wang commented 7 months ago

hardware information: agx xavier, L4T:35.4.0, Jetpack:5.1.2. and cloned the branch R35.4.1.

I tried below commands in container dustynv/local_llm:r35.3.1 and had error:

root@agx-xavier:/data/models/mlc/dist/models# python3 -m local_llm --api=mlc --model=Llama-2-7b-chat-hf /usr/local/lib/python3.8/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead. warnings.warn( 12:35:57 | INFO | loading Llama-2-7b-chat-hf with MLC 12:35:57 | INFO | running MLC quantization:

python3 -m mlc_llm.build --model /data/models/mlc/dist/models/Llama-2-7b-chat-hf --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 4096 --artifact-path /data/models/mlc/dist

Using path "/data/models/mlc/dist/models/Llama-2-7b-chat-hf" for model "Llama-2-7b-chat-hf" Target configured: cuda -keys=cuda,gpu -arch=sm_72 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32 Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_72 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32 Get old param: 0%| | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights... This may take a while. | 0/327 [00:00<?, ?tensors/s] Get old param: 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 194/197 [01:38<00:00, 4.06tensors/sFinish computing and quantizing weights.████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍| 326/327 [01:38<00:00, 10.18tensors/s] Total param size: 3.1569595336914062 GB Start storing to cache /data/models/mlc/dist/Llama-2-7b-chat-hf-q4f16_ft/params [0327/0327] saving param_326 All finished, 99 total shards committed, record saved to /data/models/mlc/dist/Llama-2-7b-chat-hf-q4f16_ft/params/ndarray-cache.json████████████████████████████████████████████████████████████████████████████████████████| 327/327 [01:50<00:00, 10.18tensors/s] Finish exporting chat config to /data/models/mlc/dist/Llama-2-7b-chat-hf-q4f16_ft/params/mlc-chat-config.json Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/dist-packages/mlc_llm/build.py", line 47, in main() File "/usr/local/lib/python3.8/dist-packages/mlc_llm/build.py", line 43, in main core.build_model_from_args(parsed_args) File "/usr/local/lib/python3.8/dist-packages/mlc_llm/core.py", line 921, in build_model_from_args mod = mod_transform_before_build(mod, param_manager, args, model_config) File "/usr/local/lib/python3.8/dist-packages/mlc_llm/core.py", line 648, in mod_transform_before_build mod = tvm.transform.Sequential( File "/usr/local/lib/python3.8/dist-packages/tvm/ir/transform.py", line 238, in call return _ffi_transform_api.RunPass(self, mod) File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3 File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL File "/usr/local/lib/python3.8/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error raise py_err File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback File "/usr/local/lib/python3.8/dist-packages/tvm/contrib/cutlass/build.py", line 1008, in profile_relax_function conv2d_profiler = CutlassConv2DProfiler(sm, _get_cutlass_path(), tmp_dir) File "/usr/local/lib/python3.8/dist-packages/tvm/contrib/cutlass/gen_conv2d.py", line 186, in init self.gemm_profiler = CutlassGemmProfiler(sm, cutlass_path, binary_path) File "/usr/local/lib/python3.8/dist-packages/tvm/contrib/cutlass/gen_gemm.py", line 197, in init assert sm in GENERATOR_FUNC_TABLE and sm in DEFAULT_KERNELS, f"sm{sm} not supported yet." AssertionError: sm72 not supported yet. Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/local_llm/local_llm/main.py", line 22, in model = LocalLM.from_pretrained( File "/opt/local_llm/local_llm/local_llm.py", line 72, in from_pretrained model = MLCModel(model_path, kwargs) File "/opt/local_llm/local_llm/models/mlc.py", line 50, in init quant = MLCModel.quantize(model_path, quant, kwargs) File "/opt/local_llm/local_llm/models/mlc.py", line 163, in quantize subprocess.run(cmd, executable='/bin/bash', shell=True, check=True)
File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/Llama-2-7b-chat-hf --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 4096 --artifact-path /data/models/mlc/dist' returned non-zero exit status 1. root@agx-xavier:/data/models/mlc/dist/models# nvidia-smi bash: nvidia-smi: command not found root@agx-xavier:/data/models/mlc/dist/models#

dusty-nv commented 7 months ago

AssertionError: sm72 not supported yet.

@UserName-wang I believe MLC only supports SM80 and Orin due to the kernel optimizations used

UserName-wang commented 6 months ago

AssertionError: sm72 not supported yet.

@UserName-wang I believe MLC only supports SM80 and Orin due to the kernel optimizations used

@dusty-nv , thank you for your reply, do you have any suggestion for the user who have to use agx xavier for LLM application?

dusty-nv commented 6 months ago

@UserName-wang on Xavier I would use llama.cpp container instead, it gets the 2nd-best performance and supports quantization

dusty-nv / jetson-containers

agx xavier dustynv/local_llm:r35.3.1 error #365