Open Lzhang-hub opened 4 months ago
This looks like an import error, probably from Flash Attention. Our import logic has an unfortunate side effect of suppressing error messages (see https://github.com/NVIDIA/TransformerEngine/pull/862#pullrequestreview-2072546018), so can you try replacing import transformer_engine
with import transformer_engine.pytorch
?
I'm having this same error. Replacing with import transformer_engine.pytorch
changes. Can you give me any hint on how to solve this?
Traceback (most recent call last):
File "/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 19, in <module>
from nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_chat_dataset import get_prompt_template_example
File "/NeMo-Aligner/venv/lib/python3.10/site-packages/nemo/collections/nlp/__init__.py", line 15, in <module>
from nemo.collections.nlp import data, losses, models, modules
File "/NeMo-Aligner/venv/lib/python3.10/site-packages/nemo/collections/nlp/models/__init__.py", line 28, in <module>
from nemo.collections.nlp.models.language_modeling import MegatronGPTPromptLearningModel
File "/NeMo-Aligner/venv/lib/python3.10/site-packages/nemo/collections/nlp/models/language_modeling/__init__.py", line 16, in <module>
from nemo.collections.nlp.models.language_modeling.megatron_gpt_prompt_learning_model import (
File "/NeMo-Aligner/venv/lib/python3.10/site-packages/nemo/collections/nlp/models/language_modeling/megatron_gpt_prompt_learning_model.py", line 31, in <module>
from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
File "/NeMo-Aligner/venv/lib/python3.10/site-packages/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 41, in <module>
from nemo.collections.nlp.models.language_modeling.megatron.falcon.falcon_spec import get_falcon_layer_spec
File "/NeMo-Aligner/venv/lib/python3.10/site-packages/nemo/collections/nlp/models/language_modeling/megatron/falcon/falcon_spec.py", line 19, in <module>
from megatron.core.transformer.attention import SelfAttention, SelfAttentionSubmodules
File "/NeMo-Aligner/venv/lib/python3.10/site-packages/megatron/core/transformer/attention.py", line 12, in <module>
from megatron.core.transformer.custom_layers.transformer_engine import SplitAlongDim
File "/NeMo-Aligner/venv/lib/python3.10/site-packages/megatron/core/transformer/custom_layers/transformer_engine.py", line 7, in <module>
import transformer_engine.pytorch as te
File "/NeMo-Aligner/venv/lib/python3.10/site-packages/transformer_engine/pytorch/__init__.py", line 34, in <module>
_load_library()
File "/NeMo-Aligner/venv/lib/python3.10/site-packages/transformer_engine/pytorch/__init__.py", line 25, in _load_library
so_path = next(so_dir.glob(f"transformer_engine_torch.*.{extension}"))
StopIteration
same error
Can you check if TE has built the required shared libraries? In particular, /NeMo-Aligner/venv/lib/python3.10/site-packages/transformer_engine
should contain libtransformer_engine.so
and something that looks like transformer_engine_torch.cpython-310-x86_64-linux-gnu.so
.
If your TE install has libtransformer_engine.so
but not transformer_engine_torch.*.so
, then that means TE did not detect PyTorch during the build process. You can try forcing TE to build with PyTorch support by settting the NVTE_FRAMEWORK
environment variable:
NVTE_FRAMEWORK=pytorch pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
See the TE install instructions.
I reinstall
pip install flash-attn==2.6.1
in NGC pytorch docker image 24.06. When I run train job, I got follow error: