Closed yaysummeriscoming closed 2 months ago
It's a known issue in 0.10.0 and please have a try with latest main branch or https://pypi.org/project/tensorrt-llm/0.11.0.dev2024061100/
That did the trick, thank you!
I'm afraid it wasn't fixed for weight-only quantization:
git clone https://huggingface.co/Qwen/Qwen2-1.5B-Instruct ./tmp/Qwen2/1.5B
CUDA_VISIBLE_DEVICES=$GPU_ID python3 convert_checkpoint.py \
--qwen_type qwen2 \
--model_dir ./tmp/Qwen2/1.5B \
--dtype float16 \
--output_dir ./tmp/Qwen2/1.5B/converted \
--use_weight_only \
--weight_only_precision int8
Gives:
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1128, in from_hugging_face
qwen.load(weights)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 438, in load
raise RuntimeError(
RuntimeError: Required but not provided tensors:{'transformer.vocab_embedding.per_token_scale'}
@nv-guomingz following up on this?
Hi @yaysummeriscoming thanks for your patience.
Base on my local testing, the latest code base has solved such issue. We'll have a weekly update soon please try with it.
That did the trick, thank you!
when I use W8A8 in 1.5B model in the release branch v0.11.0. some error occur.
" quantize(args.dtype,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1379, in quantize
safetensors.torch.save_file(
File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 284, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 480, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'transformer.vocab_embedding.weight', 'lm_head.weight'}].
A potential way to correctly save your model is to use save_model
.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors"
can not save the model weight, Has anyone encountered this situation?
System Info
Running TRTLLM 0.10, container built from nvidia/cuda:12.4.0-devel-ubuntu22.04
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Conversion works
actual behavior
additional notes
Qwen2 7B conversion works for me, but 1.5B conversion is broken.
Side question: There's no pre-built Tensorrt-LLM development container right, we need to build it ourselves?