iam encountering an AttributeError i try # If using DDP if hasattr(model, 'module'): model.module.save_pretrained(current_model_directory, max_shard_size='100GB') else: model.save_pretrained(current_model_directory, max_shard_size='100GB') but i got complicated error about dataloader what this error about it happen when i try to save model in checkpoints after some steps Update steps: 0%| | 100/150000 [32:29<462:48:31, 11.11s/it]2024-07-18 20:36:51.265 | INFO | main:main:529 - Saving model and optimizer to checkpoints/llama_100m-2024-07-18-20-01-46/model_100, update step 100 Traceback (most recent call last): File "run_pretrain.py", line 664, in main(args) File "run_pretrain.py", line 531, in main model.module.save_pretrained(current_model_directory, max_shard_size='100GB') File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'LlamaForCausalLM' object has no attribute 'module' rank0: Traceback (most recent call last): rank0: File "run_pretrain.py", line 664, in

rank0: File "run_pretrain.py", line 531, in main rank0: model.module.save_pretrained(current_model_directory, max_shard_size='100GB') rank0: File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr rank0: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")

api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run_pretrain.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-18_20:37:00 host : DESKTOP-M0GCNFO. rank : 0 (local_rank: 0) exitcode : 1 (pid: 179820) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ (myenv) (base) koko@DESKTOP-M0GCNFO:~/Q-GaLore$ Update steps: 0%| | 100/150000 [32:29<462:48:31, 11.11s/it]2024-07-18 20:36:51.265 | INFO | __main__:main:529 - Saving model and optimizer to checkpoints/llama_100m-2024-07-18-20-01-46/model_100, update step 100 Traceback (most recent call last): File "run_pretrain.py", line 664, in main(args) File "run_pretrain.py", line 531, in main model.module.save_pretrained(current_model_directory, max_shard_size='100GB') File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__ raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'") AttributeError: 'LlamaForCausalLM' object has no attribute 'module' [rank0]: Traceback (most recent call last): [rank0]: File "run_pretrain.py", line 664, in

VITA-Group / Q-GaLore

iam encountering error saving model checkpoints #6

run_pretrain.py FAILED