FSDP EKS Example failing with: module 'torch.library' has no attribute 'register_fake'

Following instructions in HyperPod EKS workshop, trying to run FSDP EKS example on 2 p5 nodes is failing with the following error, pointing towards error in train.py:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fsdp/train.py", line 281, in <module>
        main(args)if hasattr(module, attr):

  File "/fsdp/train.py", line 168, in main
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1766, in __getattr__
    model = AutoModelForCausalLM.from_config(model_config)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 439, in from_config
    model_class = _get_model_class(config, cls._model_mapping)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 388, in _get_model_class
    module = self._get_module(self._class_to_module[name])
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1780, in _get_module
    supported_models = model_mapping[type(config)]
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 763, in __getitem__
    return self._load_attr_from_module(model_type, model_name)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 777, in _load_attr_from_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
module 'torch.library' has no attribute 'register_fake'
    return getattribute_from_module(self._modules[module_name], attr)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module
    if hasattr(module, attr):
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1766, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1780, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
module 'torch.library' has no attribute 'register_fake'
[2024-11-12 02:26:41,444] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1544) of binary: /usr/bin/python3

aws-samples / awsome-distributed-training

FSDP EKS Example failing with: module 'torch.library' has no attribute 'register_fake' #491