This PR contains reference implementations for Llama 3 ChatQA at both the 8B and 70B size.
These models use the basic interface which assumes you're passing full documents or have pre-processed any retrieval steps for maximum compatibility. There is another interface style where you can use a built-in chunking model before applying context, but I'm assuming the straightforward implementation is better for testing and integrating into existing systems.
Querying the LLM uses the following data structure:
{
"messages": [
{"role": "user", "content": "what is the percentage change of the net income from Q4 FY23 to Q4 FY24?"}
],
"context": "NVIDIA (NASDAQ: NVDA) today reported revenue for the fourth quarter ended January 28, 2024, of $22.1 billion, up 22% from the previous quarter and up 265% from a year ago.\nFor the quarter, GAAP earnings per diluted share was $4.93, up 33% from the previous quarter and up 765% from a year ago. Non-GAAP earnings per diluted share was $5.16, up 28% from the previous quarter and up 486% from a year ago.\nQ4 Fiscal 2024 Summary\nGAAP\n| $ in millions, except earnings per share | Q4 FY24 | Q3 FY24 | Q4 FY23 | Q/Q | Y/Y |\n| Revenue | $22,103 | $18,120 | $6,051 | Up 22% | Up 265% |\n| Gross margin | 76.0% | 74.0% | 63.3% | Up 2.0 pts | Up 12.7 pts |\n| Operating expenses | $3,176 | $2,983 | $2,576 | Up 6% | Up 23% |\n| Operating income | $13,615 | $10,417 | $1,257 | Up 31% | Up 983% |\n| Net income | $12,285 | $9,243 | $1,414 | Up 33% | Up 769% |\n| Diluted earnings per share | $4.93 | $3.71 | $0.57 | Up 33% | Up 765% |"
}
Expected response:
<|begin_of_text|>NVIDIA reported a net income of $12,285 million for Q4 fiscal 2024, while the net income for Q4 fiscal 2023 was $1,414 million. The percentage change is calculated using the formula ((12285 - 1414) / 1414 * 100), which results in a 769% increase.<|end_of_text|>
Performance notes:
These models are running on A100 GPUs. You can change the hardware to H100 for better performance if desired. This is not an optimized implementation with VLLM or TensorRT-LLM, so higher TPS and throughput is likely possible for a production implementation. But inference speed is quite usable as-is.
This PR contains reference implementations for Llama 3 ChatQA at both the 8B and 70B size.
These models use the basic interface which assumes you're passing full documents or have pre-processed any retrieval steps for maximum compatibility. There is another interface style where you can use a built-in chunking model before applying context, but I'm assuming the straightforward implementation is better for testing and integrating into existing systems.
Querying the LLM uses the following data structure:
Expected response:
Performance notes:
These models are running on A100 GPUs. You can change the hardware to H100 for better performance if desired. This is not an optimized implementation with VLLM or TensorRT-LLM, so higher TPS and throughput is likely possible for a production implementation. But inference speed is quite usable as-is.