Following instructions in HyperPod EKS workshop, trying to run FSDP EKS example on 2 p5 nodes is failing with the following error, pointing towards error in train.py:
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/fsdp/train.py", line 281, in <module>
main(args)if hasattr(module, attr):
File "/fsdp/train.py", line 168, in main
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1766, in __getattr__
model = AutoModelForCausalLM.from_config(model_config)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 439, in from_config
model_class = _get_model_class(config, cls._model_mapping)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 388, in _get_model_class
module = self._get_module(self._class_to_module[name])
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1780, in _get_module
supported_models = model_mapping[type(config)]
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 763, in __getitem__
return self._load_attr_from_module(model_type, model_name)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 777, in _load_attr_from_module
raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
module 'torch.library' has no attribute 'register_fake'
return getattribute_from_module(self._modules[module_name], attr)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module
if hasattr(module, attr):
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1766, in __getattr__
module = self._get_module(self._class_to_module[name])
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1780, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
module 'torch.library' has no attribute 'register_fake'
[2024-11-12 02:26:41,444] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1544) of binary: /usr/bin/python3
Following instructions in HyperPod EKS workshop, trying to run FSDP EKS example on 2 p5 nodes is failing with the following error, pointing towards error in train.py: