VITA-Group / Q-GaLore

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.
Apache License 2.0
175 stars 13 forks source link

iam encountering error saving model checkpoints #6

Open Khaledbouza opened 4 months ago

Khaledbouza commented 4 months ago

iam encountering an AttributeError i try # If using DDP if hasattr(model, 'module'): model.module.save_pretrained(current_model_directory, max_shard_size='100GB') else: model.save_pretrained(current_model_directory, max_shard_size='100GB') but i got complicated error about dataloader what this error about it happen when i try to save model in checkpoints after some steps Update steps: 0%| | 100/150000 [32:29<462:48:31, 11.11s/it]2024-07-18 20:36:51.265 | INFO | main:main:529 - Saving model and optimizer to checkpoints/llama_100m-2024-07-18-20-01-46/model_100, update step 100 Traceback (most recent call last): File "run_pretrain.py", line 664, in main(args) File "run_pretrain.py", line 531, in main model.module.save_pretrained(current_model_directory, max_shard_size='100GB') File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'LlamaForCausalLM' object has no attribute 'module' rank0: Traceback (most recent call last): rank0: File "run_pretrain.py", line 664, in

rank0: File "run_pretrain.py", line 531, in main rank0: model.module.save_pretrained(current_model_directory, max_shard_size='100GB') rank0: File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr rank0: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")

wandb: / 0.053 MB of 0.053 MB uploaded wandb: Run history: wandb: loss █████████████▇▇▇▄▃▂▂▁▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: lr ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: throughput_batches ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇ wandb: throughput_examples ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇ wandb: throughput_tokens ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇ wandb: tokens_seen ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ wandb: update_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ wandb: wandb: Run summary: wandb: loss 9.375 wandb: lr 0.0 wandb: throughput_batches 0.76993 wandb: throughput_examples 49.27552 wandb: throughput_tokens 9572.05922 wandb: tokens_seen 9905694 wandb: update_step 99 wandb: wandb: 🚀 View run test at: https://wandb.ai/khaledbouzaiene365/test/runs/xe47q376 wandb: ⭐️ View project at: https://wandb.ai/khaledbouzaiene365/test wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s) wandb: Find logs at: ./wandb/run-20240718_200147-xe47q376/logs wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")! See https://wandb.me/wandb-core for more information. E0718 20:37:00.711596 140291664311360 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 179820) of binary: /home/koko/miniconda3/envs/myenv/bin/python Traceback (most recent call last): File "/home/koko/miniconda3/envs/myenv/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.3.1', 'console_scripts', 'torchrun')()) File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run_pretrain.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-18_20:37:00 host : DESKTOP-M0GCNFO. rank : 0 (local_rank: 0) exitcode : 1 (pid: 179820) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ (myenv) (base) koko@DESKTOP-M0GCNFO:~/Q-GaLore$ Update steps: 0%| | 100/150000 [32:29<462:48:31, 11.11s/it]2024-07-18 20:36:51.265 | INFO | __main__:main:529 - Saving model and optimizer to checkpoints/llama_100m-2024-07-18-20-01-46/model_100, update step 100 Traceback (most recent call last): File "run_pretrain.py", line 664, in main(args) File "run_pretrain.py", line 531, in main model.module.save_pretrained(current_model_directory, max_shard_size='100GB') File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__ raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'") AttributeError: 'LlamaForCausalLM' object has no attribute 'module' [rank0]: Traceback (most recent call last): [rank0]: File "run_pretrain.py", line 664, in