Reason of huge improvement in pytorch version

chnsh / DCRNN_PyTorch

Diffusion Convolutional Recurrent Neural Network Implementation in PyTorch

MIT License

458 stars 115 forks source link

Reason of huge improvement in pytorch version #3

Closed guiyihan closed 4 years ago

guiyihan commented 4 years ago

Thanks for your great work. Compared to original tensorflow results, this pytorch version has better performance. Could you explain the reason of this boost?

cvignac commented 4 years ago

It seems that the loss function is not exactly the same.

In the data there are two features: traffic speed and traffic volume (I don't know in what order). It seems that in the Pytorch version the loss is computed on this two features whereas in the tf DCRNN it is computed using only predictions for the first feature (line 272, https://github.com/liyaguang/DCRNN/blob/master/model/dcrnn_supervisor.py).

Is that correct?

chnsh commented 4 years ago

@cvignac that is a good catch - however, it seems like that may not be the issue, I think in the dataset, there is only 1 dimension and indexing on 0 as you pointed out does nothing to the values.

I've not been able to dig in deeper to figure out why the boost exists though.

razvanc92 commented 4 years ago

@chnsh I'm not sure that's the case. The dataset has two dimensions, speed and time, the model should predict both since they are required by the decoder when predicting the next step, but the evaluation should only be on the first dimension (speed).

chnsh commented 4 years ago

Oh, I see - thanks for pointing it out, I will try and re-evaluate as soon as possible, I will be out of office for some time though

chnsh commented 4 years ago

@razvanc92 I think you're right that the dataset has 2 dimensions (when it's constructed), but what is happening is that the final loss is calculated by slicing for the 1st dimension - the final outputs are in sized (12, 6912, 207) and the dataset is (12, 6912, 207, 2) so you see that the final loss calculation is on the correct dimensions, so the loss value is not wrong - can you confirm?

Noahprog commented 4 years ago

I think this issue is still not sufficiently answered though.

I'm really curious about why the Pytorch version has better performance. I've been digging through the code quite some time, but haven't found anything suspicious. Will keep you up to date if I find something.

Noahprog commented 4 years ago

I know what the problem is.

In the original Tensorflow implementation, the test evaluation it used to calculate MAE for every timestep separately. In the paper only the MAE for the last timestep is reported.

The next configurations are used to show an example. Screenshot 2020-07-08 at 11 37 18

In the following log (ran with the original Tensorflow code), it shows the separate MAE's for all three timesteps. This Pytorch implementation does not do that, but shows only the average of those values. Screenshot 2020-07-08 at 09 18 36

Here, the validation MAE (noted with an (1)) is equal to the exact same run with this Pytorch implementation. Also the MAE of the last timestep (noted by an (2)) is the same as the reported MAE in the published paper.

Concluding, both implementations are equally good. But, the Pytorch implementation lacks this separation of calculating MAE for every timestep. @chnsh this mistake could have been prevented..

P.S. This is the same problem as in issue #7 probably.

chnsh commented 4 years ago

@Noahprog thanks for digging! Good find, want to send a PR to fix it?

semink commented 2 years ago

I think the answer is related to #16