facebookresearch / swav

PyTorch implementation of SwAV https//arxiv.org/abs/2006.09882
Other
2.01k stars 280 forks source link

problems in run the eval_linear.py with the pretrained swav model #40

Closed ye-yechen closed 3 years ago

ye-yechen commented 4 years ago

Hi, thanks for your excellent work! I meet some problems when I run the codes. Firstly,I train the swav model with the command python -m torch.distributed.launch --nproc_per_node=2 main_swav.py ...,and the model parameters saved in the checkpoint.pth.tar. But when I run the eval_linear.py with the pretrained swav model with the command python -m torch.distributed.launch --nproc_per_node=2 eval_linear.py --pretrained checkpoint.pth.tar,I meet some errors,the logs are:

Traceback (most recent call last):
  File "/home/yc/codes/swav/src/utils.py", line 144, in restart_from_checkpoint
    msg = value.load_state_dict(checkpoint[key], strict=False)
TypeError: load_state_dict() got an unexpected keyword argument 'strict'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "eval_linear.py", line 397, in <module>
    main()
  File "eval_linear.py", line 201, in main
    scheduler=scheduler,
  File "/home/yc/codes/swav/src/utils.py", line 147, in restart_from_checkpoint
    msg = value.load_state_dict(checkpoint[key])
  File "/home/yc/anaconda3/envs/tf2/lib/python3.6/site-packages/torch/optim/optimizer.py", line 123, in load_state_dict
    raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group
Traceback (most recent call last):
  File "/home/yc/anaconda3/envs/tf2/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/yc/anaconda3/envs/tf2/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/yc/anaconda3/envs/tf2/lib/python3.6/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/home/yc/anaconda3/envs/tf2/lib/python3.6/site-packages/torch/distributed/launch.py", line 257, in main
    cmd=cmd)

Does it means that there are some errors when the optimizer restore from the checkpoints? Could you help me,thanks!

mathildecaron31 commented 3 years ago

Hi @ye-yechen Make sure that the repository used in --dump_path for eval_linear.py does not have a checkpoint.pth.tar file.

You should use a different dump_path for pretraining and evaluation. In your case, the code is trying to re-start the evaluation from checkpoint.pth.tar which is the checkpoint of pretraining, not evaluation.

ye-yechen commented 3 years ago

ok,so I commented the function restart_from_checkpoint in eval_linear.py and I can run the codes. But, if I want to train the model with the cifar-10 dataset,what parameters are important ? because I get a low accuracy in cifar-10.Thanks.

mathildecaron31 commented 3 years ago

Good.

I have no experience running models on CIFAR-10.

libingDY commented 3 years ago

I don't quite understand how to solve this problem. Can you explain it more clearly? thank you