Resume training - Githubissues

ZhengmaoHe commented 1 year ago

Thank you for this awesome work!

I want to resume training, but now I have some problems. According to my understanding, I modified the following code in go1_gym_learn/ppo_cse/__init__.py and tried to resume training.

class RunnerArgs(PrefixProto, cli=False):

    ...

    # load and resume
    resume = True
    load_run = -1  # -1 = last run
    checkpoint = -1  # -1 = last saved model

    label = "gait-conditioned-agility/2023-05-18/train"
    dirs = glob.glob(f"../runs/{label}/*")
    logdir = sorted(dirs)[0]

    resume_path = logdir[3:-1]  # updated from load_run and chkpt
    resume_curriculum = True

loader = ML_Logger(root="http://127.0.0.1:8081",
                               prefix=RunnerArgs.resume_path)

After starting the traning I can see the corresponding output in ml_dash.server

[2023-05-18 21:25:56 +0800] - (sanic.access)[INFO][127.0.0.1:53682]: GET http://127.0.0.1:8081/files/runs/gait-conditioned-agility/base-policy/train/053912.36310/checkpoints/ac_weights_last.pt  404 651

and

Traceback (most recent call last):
  File "train.py", line 263, in <module>
    train_go1(headless=True)
  File "train.py", line 222, in train_go1
    runner = Runner(env, device=f"cuda:{gpu_id}")
  File "/home/mao/Desktop/FunPro/walk-these-ways/go1_gym_learn/ppo_cse/__init__.py", line 102, in __init__
    weights = loader.load_torch("checkpoints/ac_weights_last.pt")
  File "/home/mao/.conda/envs/legged/lib/python3.8/site-packages/ml_logger/ml_logger.py", line 2098, in load_torch
    return torch.load(fn_or_buff, map_location=map_location, **kwargs)
  File "/home/mao/.conda/envs/legged/lib/python3.8/site-packages/torch/serialization.py", line 608, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/mao/.conda/envs/legged/lib/python3.8/site-packages/torch/serialization.py", line 777, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.

Could you give me some advice to solve this problem?

gmargo11 commented 1 year ago

Hi @MariaBana ,

Unfortunately I've noticed unpickling .pt files on a new machine can be unreliable. I switched to .jit files for deployment and the play script https://github.com/Improbable-AI/walk-these-ways/blob/master/scripts/play.py.

I haven't figured out why yet, but I think it can result from different pickle version, different torch version, other differences in dependencies, or a corrupted .pt file. If the .pt file was corrupted during download, you could try deleting it and re-downloading it from Github. Otherwise, the most certain solution will be to train a new model from scratch in your own development setup, and then try loading that model, since you'll know it's in the correct format

-Gabe

ZhengmaoHe commented 12 months ago

Thank you for your suggestion. After various attempts, I finally solved this problem by directly using weights = torch.load(RunnerArgs.resume_path), which might be helpful for others.

fangzhiyuan1995 commented 4 months ago

Thank you for your suggestion. After various attempts, I finally solved this problem by directly using weights = torch.load(RunnerArgs.resume_path), which might be helpful for others.

Excuse me, can you post the modification code here？

Improbable-AI / walk-these-ways

Resume training #25