Open williamFalcon opened 5 years ago
The error is coming from Pytorch internally. This is strange because the casts Amp inserts are autograd-exposed, so they should be exactly reversed in backward. In other words, if the forward pass fails, the backward pass should succeed.
What of the types of the inputs? Also, does it still fail if you explicitly cast the inputs to float? For unblocking purposes, you may cast the inputs to float and run the function with casting disabled, ie
with amp.disable_casts():
nce_loss(self.float(), r_src.float(), r_trg.float(), mask_mat.float())
Also please tell me you're running with opt_level="O1". O2 is a disaster, a legacy from our initial experiments with mixed precision, sadly still necessary to support internal usage. The Pytorch native integration which is my main task right now will be O1-like exclusively.
In general debugging the backward pass is hard. You can try to catch the exception to at least figure out the name of the op that's failing on the c++ side:
$ gdb python
...
(gdb) catch throw
(gdb) run script.py args
....gdb will halt when the exception is thrown
(gdb) bt
...C++-side backtrace that will tell you the name of the op that's failing
Any updates about this topic ? I encountered the same question. The version of pytorch is 1.2.0.
Upgrading to pytorch 1.3.1 worked for me
What is the best way to figure out where this issue is happening in the graph? The message is also unclear: What exactly is expecting the Half? would be helpful to print the particular node in the graph where this breaks.
Sorry, I don't really have a super clean implementation which can reproduce this.
The piece of code that likely has the issue is here: