chnsh / DCRNN_PyTorch

Diffusion Convolutional Recurrent Neural Network Implementation in PyTorch
MIT License
440 stars 111 forks source link

Weights at epoch 64 not found #1

Closed victorsoda closed 4 years ago

victorsoda commented 4 years ago

I was running the script of run_demo_pytorch.py using the command: python run_demo_pytorch.py --config_filename=data/model/pretrained/METR-LA/config.yaml

This is what I got: Traceback (most recent call last): File "run_demo_pytorch.py", line 33, in run_dcrnn(args) File "run_demo_pytorch.py", line 18, in run_dcrnn supervisor = DCRNNSupervisor(adj_mx=adj_mx, **supervisor_config) File "/home/cyd/DCRNN_PyTorch/model/pytorch/dcrnn_supervisor.py", line 50, in init self.load_model() File "/home/cyd/DCRNN_PyTorch/model/pytorch/dcrnn_supervisor.py", line 93, in load_model assert os.path.exists('models/epo%d.tar' % self._epoch_num), 'Weights at epoch %d not found' % self._epoch_num AssertionError: Weights at epoch 64 not found

Could you please upload the 'models/epo64.tar' to the repo? I hope to reproduce the MAE results demonstrated in README. Thx!

weiyinchiang commented 4 years ago

@zhenhuascut @victorsoda "'models/epo64.tar' " is not provided in this branch........

zhenhuascut commented 4 years ago

Thanks, I have solved it.

yuanlics commented 4 years ago

Still don't see epo64.tar ... It would be great if you could upload it. Many thanks!

diving16 commented 4 years ago

Still don't see too... Would you upload it then I could reproduce your work? Thanks a lot.

htn274 commented 4 years ago

I also got the same trouble, but at 51 epoch. Can you help me solve it? Thanks a lot.

baosws commented 4 years ago

Setting train/epoch to 0 in data/model/dcrnn_test_config.yaml solves the problem for me.

htn274 commented 4 years ago

Setting train/epoch to 0 in data/model/dcrnn_test_config.yaml solves the problem.

Thanks! I tried successfully.

chnsh commented 4 years ago

Hi everyone, I am sorry I have been unable to reply - @htn274 is correct - the epoch is intended to be the checkpoint from where to resume and I have since then shut down the server (I am a poor student 🥼) and haven't been able to retrieve the weights at epoch 64 - if you just set it to 0 and train, you will be able to reproduce the results

semink commented 4 years ago

I don't believe setting train/epoch to 0 is a solution for this problem. As you train the model by the code, the trained models are saved as models/epo0.tar, epo1.tar,... etc. Therefore if you run "run_demo" with train/epoch = 0, it means you run demo with the trained model only with the first epoch. So I ask you to add the best model (epoXX.tar) at the models/ for METR-LA, and PEMS-BAY then we can test with these models

chnsh commented 4 years ago

@semink that is incorrect - the model only tries to load existing weights if epoch > 0, so by setting epoch=0 will do the job as @baosws helpfully pointed out.

I am closing this issue for now - the solution is to train it and once it has trained, set the correct epoch number in config.yml and that should work

mdanb commented 4 years ago

@chnsh this didn't work for me. I still get the same error

mdanb commented 4 years ago

@chnsh actually I ended up doing the change in data/model/pretrained/METR-LA/config.yaml and that did it