Conflict between bf16-mixed Precision Setting and MegatronHalfPrecisionPlugin in MegatronGPT Training

moutasemalakkad commented 4 weeks ago

Title: Conflict between precision settings and MegatronHalfPrecisionPlugin in MegatronGPT training

Describe the bug

When attempting to continue training the MegatronGPT model, I encountered a conflict between precision=bf16-mixed and the MegatronHalfPrecisionPlugin. This results in a ValueError indicating that both precision=bf16-mixed and the MegatronHalfPrecisionPlugin were received and only one should be chosen.

Steps/Code to reproduce bug

Set up the environment as described below.
Use the following configuration and code snippet to initiate training.

Configuration:

DATA='{train:[1.0,training_data_indexed/train_text_document], validation:[training_data_indexed/val_text_document], test:[training_data_indexed/test_text_document]}'

!torchrun --nproc_per_node=1 /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_continue_training.py \
    model.data.data_prefix="$DATA" \
    name=megatron_gpt_ \
    exp_manager.name=megatron_gpt_1 \
    restore_from_path='/workspace/new_nemo_out/new_megatron_gpt_model.nemo' \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    trainer.precision=16-mixed \
    trainer.val_check_interval=300 \
    trainer.max_steps=1200 \
    model.megatron_amp_O2=False \
    model.tensor_model_parallel_size=1 \
    model.pipeline_model_parallel_size=1 \
    model.micro_batch_size=1 \
    model.global_batch_size=1 \
    ++model.use_flash_attention=False \
    ++model.seq_len_interpolation_factor=null

Error Message:

[NeMo W 2024-06-09 00:24:29 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.

[NeMo W 2024-06-09 00:24:29 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/lightning_fabric/connector.py:563: `precision=bf16` is supported for historical reasons but its usage is discouraged. Please set your precision to bf16-mixed instead!

Error executing job with overrides: []
Traceback (most recent call last):
  File "/opt/NeMo/examples/nlp/language_modeling/megatron_gpt_continue_training.py", line 167, in main
    trainer = Trainer(plugins=plugins, strategy=strategy, **cfg.trainer, callbacks=callbacks)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/argparse.py", line 70, in insert_env_defaults
    return fn(self, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 401, in __init__
    self._accelerator_connector = _AcceleratorConnector(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 134, in __init__
    self._check_config_and_set_final_flags(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 271, in _check_config_and_set_final_flags
    raise ValueError(
ValueError: Received both `precision=bf16-mixed` and `plugins=<nemo.collections.nlp.parts.nlp_overrides.MegatronHalfPrecisionPlugin object at 0x7fed4a8568c0>`. Choose one.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Expected behavior

The training should proceed without conflicts between precision settings and plugins.

Environment overview (please complete the following information)

Environment location: Docker

Environment details

OS version: Ubuntu 20.04
PyTorch version: 1.13.1
Python version: 3.10

Additional context

GPU model: NVIDIA A100
NVIDIA Driver version: 470.57.02
CUDA version: 11.4

analogtechnica commented 3 weeks ago

@moutasemalakkad

Hi, I've just faced same issue.

work on Docker image ; nvcr.io/nvidia/nemo:24.03.framework

Here is my solution : insert cfg.trainer.precision = None above this line. https://github.com/NVIDIA/NeMo/blob/ebba8b14263ca513c4453fcde0472785c19f46c1/examples/nlp/language_modeling/megatron_gpt_continue_training.py#L167

This solution is inspired from this PR https://github.com/NVIDIA/NeMo/pull/8908/

It should solve the conflict.

moutasemalakkad commented 3 weeks ago

Thanks! That also did not work, the work around was to set the plugins to an empty list

trainer = Trainer(plugins=[], strategy=strategy, **cfg.trainer, callbacks=callbacks)

NVIDIA / NeMo

Conflict between bf16-mixed Precision Setting and MegatronHalfPrecisionPlugin in MegatronGPT Training #9429