iam encountering an AttributeError i try # If using DDP
if hasattr(model, 'module'):
model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
else:
model.save_pretrained(current_model_directory, max_shard_size='100GB')
but i got complicated error about dataloader
what this error about it happen when i try to save model in checkpoints after some steps Update steps: 0%| | 100/150000 [32:29<462:48:31, 11.11s/it]2024-07-18 20:36:51.265 | INFO | main:main:529 - Saving model and optimizer to checkpoints/llama_100m-2024-07-18-20-01-46/model_100, update step 100
Traceback (most recent call last):
File "run_pretrain.py", line 664, in
main(args)
File "run_pretrain.py", line 531, in main
model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'LlamaForCausalLM' object has no attribute 'module'
rank0: Traceback (most recent call last):
rank0: File "run_pretrain.py", line 664, in
rank0: File "run_pretrain.py", line 531, in main
rank0: model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
rank0: File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattrrank0: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
wandb: / 0.053 MB of 0.053 MB uploaded
wandb: Run history:
wandb: loss █████████████▇▇▇▄▃▂▂▁▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: lr ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: throughput_batches ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇
wandb: throughput_examples ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇
wandb: throughput_tokens ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇
wandb: tokens_seen ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: update_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:
wandb: Run summary:
wandb: loss 9.375
wandb: lr 0.0
wandb: throughput_batches 0.76993
wandb: throughput_examples 49.27552
wandb: throughput_tokens 9572.05922
wandb: tokens_seen 9905694
wandb: update_step 99
wandb:
wandb: 🚀 View run test at: https://wandb.ai/khaledbouzaiene365/test/runs/xe47q376
wandb: ⭐️ View project at: https://wandb.ai/khaledbouzaiene365/test
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s)
wandb: Find logs at: ./wandb/run-20240718_200147-xe47q376/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")! See https://wandb.me/wandb-core for more information.
E0718 20:37:00.711596 140291664311360 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 179820) of binary: /home/koko/miniconda3/envs/myenv/bin/python
Traceback (most recent call last):
File "/home/koko/miniconda3/envs/myenv/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.3.1', 'console_scripts', 'torchrun')())
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
run_pretrain.py FAILED
Failures:
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-18_20:37:00
host : DESKTOP-M0GCNFO.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 179820)
error_file:
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(myenv) (base) koko@DESKTOP-M0GCNFO:~/Q-GaLore$ Update steps: 0%| | 100/150000 [32:29<462:48:31, 11.11s/it]2024-07-18 20:36:51.265 | INFO | __main__:main:529 - Saving model and optimizer to checkpoints/llama_100m-2024-07-18-20-01-46/model_100, update step 100
Traceback (most recent call last):
File "run_pretrain.py", line 664, in
main(args)
File "run_pretrain.py", line 531, in main
model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'LlamaForCausalLM' object has no attribute 'module'
[rank0]: Traceback (most recent call last):
[rank0]: File "run_pretrain.py", line 664, in
iam encountering an AttributeError i try # If using DDP if hasattr(model, 'module'): model.module.save_pretrained(current_model_directory, max_shard_size='100GB') else: model.save_pretrained(current_model_directory, max_shard_size='100GB') but i got complicated error about dataloader what this error about it happen when i try to save model in checkpoints after some steps Update steps: 0%| | 100/150000 [32:29<462:48:31, 11.11s/it]2024-07-18 20:36:51.265 | INFO | main:main:529 - Saving model and optimizer to checkpoints/llama_100m-2024-07-18-20-01-46/model_100, update step 100 Traceback (most recent call last): File "run_pretrain.py", line 664, in
main(args)
File "run_pretrain.py", line 531, in main
model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'LlamaForCausalLM' object has no attribute 'module'
rank0: Traceback (most recent call last):
rank0: File "run_pretrain.py", line 664, in
rank0: File "run_pretrain.py", line 531, in main rank0: model.module.save_pretrained(current_model_directory, max_shard_size='100GB') rank0: File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr rank0: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
wandb: / 0.053 MB of 0.053 MB uploaded wandb: Run history: wandb: loss █████████████▇▇▇▄▃▂▂▁▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: lr ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: throughput_batches ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇ wandb: throughput_examples ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇ wandb: throughput_tokens ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇ wandb: tokens_seen ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ wandb: update_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ wandb: wandb: Run summary: wandb: loss 9.375 wandb: lr 0.0 wandb: throughput_batches 0.76993 wandb: throughput_examples 49.27552 wandb: throughput_tokens 9572.05922 wandb: tokens_seen 9905694 wandb: update_step 99 wandb: wandb: 🚀 View run test at: https://wandb.ai/khaledbouzaiene365/test/runs/xe47q376 wandb: ⭐️ View project at: https://wandb.ai/khaledbouzaiene365/test wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s) wandb: Find logs at: ./wandb/run-20240718_200147-xe47q376/logs wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with
sys.exit(load_entry_point('torch==2.3.1', 'console_scripts', 'torchrun')())
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
wandb.require("core")
! See https://wandb.me/wandb-core for more information. E0718 20:37:00.711596 140291664311360 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 179820) of binary: /home/koko/miniconda3/envs/myenv/bin/python Traceback (most recent call last): File "/home/koko/miniconda3/envs/myenv/bin/torchrun", line 33, inrun_pretrain.py FAILED
Failures: