NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.38k stars 2.32k forks source link

[BUG] Circular Dependency in transformer_engine and core.utils #1224

Open paulaserna16 opened 6 days ago

paulaserna16 commented 6 days ago

I'm trying to run the Dreambooth tutorial, but I'm encountering some issues in the modules.

First, the megatron-lm version being installed when launching a container with nemo framework 24.07, doesn't have the extensions module with the transformer engine.

Then, I tried to make something manually to solve it and added the extensions folder to the megatron/core path. However, 7if I try to execute the dreambooth.py example:

! python /opt/NeMo/examples/multimodal/text_to_image/dreambooth/dreambooth.py \ model.unet_config.from_pretrained=/ckpts/unet.bin \ model.unet_config.from_NeMo=False \ model.first_stage_config.from_pretrained=/ckpts/vae.bin \ model.data.instance_dir=/datasets/instance_dir \ model.data.instance_prompt='a photo of a sks dog'

I get an error of:

ImportError Traceback (most recent call last) Cell In[13], line 4 2 from megatron.core.distributed import DistributedDataParallel as McoreDDP 3 from megatron.core.distributed import DistributedDataParallelConfig ----> 4 from megatron.core.extensions.transformer_engine import ( 5 TEColumnParallelLinear, 6 TEDotProductAttention, 7 TELayerNormColumnParallelLinear, 8 TENorm, 9 TERowParallelLinear, 10 ) 11 from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add 12 from megatron.core.models.gpt import GPTModel as MCoreGPTModel

File /opt/megatron-lm/megatron/core/extensions/transformer_engine.py:34 32 from megatron.core.transformer.transformer_config import TransformerConfig 33 from megatron.core.transformer.utils import make_sharded_tensors_for_checkpoint ---> 34 from megatron.core.utils import get_te_version, is_te_min_version 37 def _get_extra_te_kwargs(config: TransformerConfig): 38 extra_transformer_engine_kwargs = {"params_dtype": config.params_dtype}

ImportError: cannot import name 'get_te_version' from 'megatron.core.utils' (/opt/megatron-lm/megatron/core/utils.py)

After checking the files, I see that the one in line 4, tranformer engine is trying to import the function from utils: from megatron.core.utils import get_te_version, is_te_min_version And in fact, I checked the megatron.core utils.py file, and this one is calling the tranformer engine within that function:

def get_te_version(): """Get TE version from version; if not available use pip's. Use caching."""

def get_te_version_str():
    import **transformer_engine** as te

    if hasattr(te, '__version__'):
        return str(te.__version__)
    else:
        return version("transformer-engine")

global _te_version
if _te_version is None:
    _te_version = PkgVersion(get_te_version_str())
return _te_version

I would appreciate your help.

Thanks!

shanmugamr1992 commented 12 hours ago

Hi , sorry could you pull in the megatron core branch (maybe core_r0.9.0 or the latest one) and mount it to /opt/megatorn-lm so that you replace the entire megatron-core rather than manually doing something by adding extensions alone.