Open chuanchuan12138 opened 5 years ago
Here's my experimentation with and without tanh in the encoder. Note, I've ensured I've set my model to eval + no_grad before predicting and no_grad during validation. which is different in this repo and I believe it should have been implemented.
without tanh in encoder
with tanh in encoder
In addition, during training the validation loss will reduce faster with tanh. 10 epochs with tanh with tanh in encoder
Note: I've trained, validated and predicted over the whole dataset for testing purposes. My assumption was I should get near 99%+ accuracy if the underlying equations are working properly.
Hi worulz, thanks for your careful experiment, it really clears up my confusion. As for your no_grad operantion, I think main.py doesn't consider to have a validation or predict operation, it just train the model , while in the predict function , in my opinion, it just aims to show the loss of that train epoch, you may consider it a train process. I don't know if it's correct or not, but I think the no_grad function is used in validation or test process, so it's necessary if you want to evaluate the model, but not this place, maybe another function. Thank you again for your clear pics for comparison.
Firstly, thanks for your code ,it really helps me a lot to understander the paper. But when i debug the code , i find that in modules.py seanny used tanh in decoder while omit it in encoder ,but in paper ,formula 8 and 12 both use tanh to calculate part of attention weight. I dont know why , can anybody offer some help?Thanks in advance !