aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
203 stars 86 forks source link

FSDP EKS Example failing with: module 'torch.library' has no attribute 'register_fake' #491

Open nghtm opened 2 weeks ago

nghtm commented 2 weeks ago

Following instructions in HyperPod EKS workshop, trying to run FSDP EKS example on 2 p5 nodes is failing with the following error, pointing towards error in train.py:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fsdp/train.py", line 281, in <module>
        main(args)if hasattr(module, attr):

  File "/fsdp/train.py", line 168, in main
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1766, in __getattr__
    model = AutoModelForCausalLM.from_config(model_config)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 439, in from_config
    model_class = _get_model_class(config, cls._model_mapping)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 388, in _get_model_class
    module = self._get_module(self._class_to_module[name])
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1780, in _get_module
    supported_models = model_mapping[type(config)]
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 763, in __getitem__
    return self._load_attr_from_module(model_type, model_name)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 777, in _load_attr_from_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
module 'torch.library' has no attribute 'register_fake'
    return getattribute_from_module(self._modules[module_name], attr)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module
    if hasattr(module, attr):
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1766, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1780, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
module 'torch.library' has no attribute 'register_fake'
[2024-11-12 02:26:41,444] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1544) of binary: /usr/bin/python3
nghtm commented 5 days ago

Suspect this is due to an issue in the underlying docker container used in FSDP example. Needs further investigation.

cc @sean-smith