Open VladislavDuma opened 1 month ago
Hi @VladislavDuma , Qwen2 is not officially supported yet, we are still working on the verification. BTW, "Output without quantization (FP16)" is the result of TRT-LLM or HF? We need to make sure the sampling configs of FP16 and FP8 are the same.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
System Info
Library:
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
TensorRT-LLM/examples/qwen
folderQwen2-7B-Instruct
modelconfig.json
add this fields:Then use
trtllm-build
with quantization checkpointmodel_id = f'{path_to_engine}/Qwen2-7B-Instruct_1gpu_fp8_engine/' tokenizer_id = f'{path_to_model}/Qwen2-7B-Instruct'
llm = LLM( model=model_id, tokenizer=tokenizer_id )
prompts = [ f'### INPUT:\nThe GeForce RTX 4090 is an enthusiast-class graphics card by NVIDIA, launched on September 20th, 2022.' f' Built on the 5 nm process, and based on the AD102 graphics processor, in its AD102-300-A1 variant, the card ' f'supports DirectX 12 Ultimate. This ensures that all modern games will run on GeForce RTX 4090. Additionally, the ' f'DirectX 12 Ultimate capability guarantees support for hardware-raytracing, variable-rate shading and more, in ' f'upcoming video games. The AD102 graphics processor is a large chip with a die area of 609 mm² and 76,300 million ' f'transistors. Unlike the fully unlocked TITAN Ada, which uses the same GPU but has all 18432 shaders enabled, ' f'NVIDIA has disabled some shading units on the GeForce RTX 4090 to reach the product\'s target shader count. ' f'It features 16384 shading units, 512 texture mapping units, and 176 ROPs. Also included are 512 tensor cores which' f' help improve the speed of machine learning applications. The card also has 128 raytracing acceleration cores. ' f'NVIDIA has paired 24 GB GDDR6X memory with the GeForce RTX 4090, which are connected using a 384-bit memory ' f'interface. The GPU is operating at a frequency of 2235 MHz, which can be boosted up to 2520 MHz, memory is ' f'running at 1313 MHz (21 Gbps effective).\n\n### INSTRUCTIONS:\nWhat is a RTX 4090?\n\n### OUTPUT:\n' ]
sampling_params = SamplingParams( temperature=0, max_new_tokens=128, top_k=20, top_p=0.5, repetition_penalty=1.1 ) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f'{prompt}') print(f'{generated_text}')