Keyerror in loading final ckpt

yipclam commented 2 months ago

Hi, I try to reproduce scores in your paper with your final checkpoint then get the following error.

[2024-09-09 19:41:34,814][accelerate.checkpointing][INFO] - All model weights loaded successfully [2024-09-09 19:41:34,814][accelerate.checkpointing][INFO] - All optimizer states loaded successfully [2024-09-09 19:41:34,814][accelerate.checkpointing][INFO] - All scheduler states loaded successfully [2024-09-09 19:41:34,814][accelerate.checkpointing][INFO] - All dataloader sampler states loaded successfully [2024-09-09 19:41:34,816][accelerate.checkpointing][INFO] - Could not load random states Error executing job with overrides: [] Traceback (most recent call last): File "/root/autodl-tmp/yzl/dg/digirl/scripts/run.py", line 120, in main eval_loop(env = env, File "/root/autodl-tmp/yzl/dg/digirl/digirl/algorithms/eval_loop.py", line 61, in eval_loop trainer.load(os.path.join(save_path, 'trainer.pt')) File "/root/autodl-tmp/yzl/dg/digirl/digirl/algorithms/filteredbc/trainer.py", line 78, in load self.accelerator.load_state(path) File "/root/autodl-tmp/yzl/.conda/envs/dg/lib/python3.10/site-packages/accelerate/accelerator.py", line 3156, in load_state self.step = override_attributes["step"] KeyError: 'step'

I guess it's the accelerate version problem because there is a similar problem in https://github.com/huggingface/accelerate/issues/3067. But downgrading to v0.31 didnt make it work and I get another error.

Traceback (most recent call last): File "/root/autodl-tmp/yzl/dg/digirl/scripts/run.py", line 3, in from digirl.environment import BatchedAndroidEnv File "/root/autodl-tmp/yzl/dg/digirl/digirl/environment/init.py", line 1, in from .env_utils import batch_interact_environment File "/root/autodl-tmp/yzl/dg/digirl/digirl/environment/env_utils.py", line 4, in import accelerate File "/root/autodl-tmp/yzl/.conda/envs/dg/lib/python3.10/site-packages/accelerate/init.py", line 16, in from .accelerator import Accelerator File "/root/autodl-tmp/yzl/.conda/envs/dg/lib/python3.10/site-packages/accelerate/accelerator.py", line 35, in from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state File "/root/autodl-tmp/yzl/.conda/envs/dg/lib/python3.10/site-packages/accelerate/checkpointing.py", line 24, in from .utils import ( File "/root/autodl-tmp/yzl/.conda/envs/dg/lib/python3.10/site-packages/accelerate/utils/init.py", line 183, in from .fsdp_utils import load_fsdp_model, load_fsdp_optimizer, merge_fsdp_weights, save_fsdp_model, save_fsdp_optimizer File "/root/autodl-tmp/yzl/.conda/envs/dg/lib/python3.10/site-packages/accelerate/utils/fsdp_utils.py", line 36, in import torch.distributed.checkpoint.format_utils as dist_cp_format_utils File "/root/autodl-tmp/yzl/.conda/envs/dg/lib/python3.10/site-packages/torch/distributed/checkpoint/format_utils.py", line 12, in from torch.distributed.checkpoint.default_planner import ( ImportError: cannot import name '_EmptyStateDictLoadPlanner' from 'torch.distributed.checkpoint.default_planner' (/root/autodl-tmp/yzl/.conda/envs/dg/lib/python3.10/site-packages/torch/distributed/checkpoint/default_planner.py)

yipclam commented 2 months ago

Details are as follows.

(dg) yzl@autodl-container-06ff47b7fc-1ea0e1e3:~/dg/digirl/scripts$ python run.py --config-path config/main --config-name eval_only task_set: webshop task_split: test eval_sample_mode: sequential max_steps: 20 huggingface_token: xx wandb_key: '' gemini_key: xx policy_lm: /root/autodl-tmp/yzl/dg/Auto-UI-Base critic_lm: roberta-base capacity: 2000 epochs: 5 batch_size: 8 bsize: 4 rollout_size: 16 grad_accum_steps: 32 warmup_iter: 0 actor_epochs: 20 trajectory_critic_epochs: 5 lm_lr: 0.0001 critic_lr: 0.0001 max_grad_norm: 0.01 gamma: 0.5 use_lora: false agent_name: autoui do_sample: true temperature: 1.0 tau: 0.01 max_new_tokens: 128 record: false use_wandb: false entity_name: '' project_name: '' android_avd_home: /root/autodl-tmp/yzl/.android/avd emulator_path: /root/autodl-tmp/yzl/.android/emulator/emulator adb_path: /root/autodl-tmp/yzl/.android/platform-tools/adb cache_dir: /root/autodl-tmp/yzl/.cache assets_path: /root/autodl-tmp/yzl/dg/digirl/digirl/environment/android/assets/task_set save_path: /root/autodl-tmp/yzl/logs/ckpts/webshop-off2on-digirl/ run_name: autoui-general-eval-only train_algorithm: digirl task_mode: evaluate parallel: single eval_iterations: 6 save_freq: 3

The token has not been saved to the git credentials helper. Pass add_to_git_credential=True in this function directly or --add-to-git-credential if using via huggingface-cli if you want to set the git credential as well. Token is valid (permission: fineGrained). Your token has been saved to /root/autodl-tmp/yzl/.cache/huggingface/token Login successful

Agent: autoui Evauation mode /root/autodl-tmp/yzl/.conda/envs/dg/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.get(instance, owner)() /root/autodl-tmp/yzl/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.04s/it] starting appium server at port 6652 starting appium server at port 6653 starting appium server at port 6654 starting appium server at port 6655 Using DigiRL trainer Loading from previous checkpoint [2024-09-09 19:15:18,719][accelerate.accelerator][INFO] - Loading states from /root/autodl-tmp/yzl/logs/ckpts/webshop-off2on-digirl/trainer.pt [2024-09-09 19:15:20,319][accelerate.checkpointing][INFO] - All model weights loaded successfully [2024-09-09 19:15:20,319][accelerate.checkpointing][INFO] - All optimizer states loaded successfully [2024-09-09 19:15:20,319][accelerate.checkpointing][INFO] - All scheduler states loaded successfully [2024-09-09 19:15:20,320][accelerate.checkpointing][INFO] - All dataloader sampler states loaded successfully [2024-09-09 19:15:20,321][accelerate.checkpointing][INFO] - Could not load random states Error executing job with overrides: [] Traceback (most recent call last): File "/root/autodl-tmp/yzl/dg/digirl/scripts/run.py", line 120, in main eval_loop(env = env, File "/root/autodl-tmp/yzl/dg/digirl/digirl/algorithms/eval_loop.py", line 61, in eval_loop trainer.load(os.path.join(save_path, 'trainer.pt')) File "/root/autodl-tmp/yzl/dg/digirl/digirl/algorithms/digirl/trainer.py", line 305, in load self.accelerator.load_state(path) File "/root/autodl-tmp/yzl/.conda/envs/dg/lib/python3.10/site-packages/accelerate/accelerator.py", line 3156, in load_state self.step = override_attributes["step"] KeyError: 'step'

BiEchi commented 2 months ago

Did you try the latest version of accelerate?
It might be a problem with the checkpoint. Try download again (do not interrupt the download) and then load.
It is not advised to store things under the root directory as weird things might happen.
If all solutions above do not help, try reproducing a minimal working example on the official HuggingFace tutorial. If you can not reproduce this minimal working example, please reach out to the accelerator team instead. If you can reproduce, please get back here and I'll see what I can do for you.

yipclam commented 2 months ago

I tried v0.31 and v0.33 before and the latest version v0.34 released last week works. Thanks for the help!

DigiRL-agent / digirl

Keyerror in loading final ckpt #17