dvgodoy / PyTorchStepByStep

Official repository of my book: "Deep Learning with PyTorch Step-by-Step: A Beginner's Guide"
https://pytorchstepbystep.com
MIT License
834 stars 310 forks source link

Why do we have detach() in self.alphas = alphas.detach() in Attention class in Chapter 9. #23

Closed eisthf closed 3 years ago

eisthf commented 3 years ago

I wonder why alphas is "detach()"ed and before saved to self.alphas in Attention class. I tried self.alphas = alphas, that is, without detach and trained the model. There is no difference in performance. So I believe the reason is in something else.

Thank you for your great teaching in your great book!

dvgodoy commented 3 years ago

Hi,

Thank you for supporting my work, and for your kind words :-)

Regarding the "detachment" of the alphas, the main idea is to prevent unintentional changes to the dynamic computation graph. If you don't detach the alphas, it shouldn't change anything in the training process, as you already noticed.

But let's say you pause training, and decides to take a peek at the alphas. You may end-up performing an operation on them, and, since the graph keeps track of every operation performed on gradient-requiring tensors and its dependencies, it will impact the graph. That may be an issue if you resume training afterward.

In other circumstances, like the validation loop, we wrap the operations with a no_grad context manager to prevent potential problems. The same goes for the detachment of the alphas - it's there as a safeguard, to make sure that it's totally safe to play with the values in self.alphas. It's also convenient, because you'd need to detach them anyway if you wanted to make the alphas Numpy arrays.

Hope it helps :-)

eisthf commented 3 years ago

Thank you so much! :-))