NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.5k stars 2.41k forks source link

Llama3-8b FP8 PTQ OOM #9981

Closed JeevanBhoot closed 5 days ago

JeevanBhoot commented 1 month ago

Describe the bug

Running FP8 PTQ of Llama3-8b on 1x 4090 (24GB) goes OOM? Is this expected? vLLM FP8 quantization works on the same GPU. What are the minimum requirements to run this quantization?

I have even tried setting batch size to 1 and it still goes OOM.

Steps/Code to reproduce bug

python scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --output_path ./llama3_8b_instruct.nemo --precision bf16

python examples/nlp/language_modeling/megatron_gpt_ptq.py model.restore_from_path=llama3_8b_instruct.nemo quantization.algorithm=fp8 export.decoder_type=llama export.save_path=llama3_8b_instruct_fp8 export.inference_tensor_parallel=1 trainer.num_nodes=1 trainer.devices=1

Environment overview (please complete the following information)

janekl commented 1 month ago

Thanks for reporting this. This workload -- examples/nlp/language_modeling/megatron_gpt_ptq.py -- was tested on 1xH100 80GB GPU but I agree that it looks excessive, we'll take a look. For completeness: could you perhaps share the full log for your issue?

Please also note that currently this memory requirements would apply only for the calibration step (i.e. the script linked above). For TensorRT-LLM engine the memory consumption is as expected: ca. ~9GB for FP8 Llama3-8B model (you can create engines using tests/export/nemo_export.py, for example).

janekl commented 3 weeks ago

To successfully calibrate and export the Llama3-8b model on, for example, L4 24GB GPU you can use:

python megatron_gpt_ptq.py \
    model.restore_from_path=Llama-8b \
    +model.dist_ckpt_load_on_device=False \
    +model.megatron_amp_O2=true \
    +model.precision=bf16 \
    trainer.precision=bf16 \
    quantization.algorithm=fp8 \
    export.dtype=bf16 \
    inference.batch_size=16

Explanations:

  1. The knobs below correctly setup the model for evaluation in bf16. Disabling dist_ckpt_load_on_device avoids memory spikes on model loading:
    +model.dist_ckpt_load_on_device=False \
    +model.megatron_amp_O2=true \
    +model.precision=bf16 \
    trainer.precision=bf16 \
  2. Parameter export.dtype=bf16 should be the same as model precision -- either 16 or bf16 -- to avoid data cast on the export step that may also cause OOM.
  3. I had to use slightly lower inference.batch_size=16.

This should do it. Let me know if you have any other issues/questions.