Open desync-securax opened 4 years ago
@desync-securax — thanks for reporting. I'm still ironing out operators in which we're doing implicit casts (and making these explicit) given some changes to the AMP code. I'm hoping to commit all of these changes in the next week or two. Note that these will differ from the original changes from mtmd in that operators will be strongly typed at forward time, meaning you won't get cryptic errors like this when calling backward.
Let me know if you have any other questions. I'll leave this issue open to track this as I start committing those changes. For now, you can continue using mtmd's branches as needed.
Bug Description
While trying to start training, using the AMP, the following run-time error is being induced: _Epoch 1 started! terminate called after throwing an instance of 'std::invalidargument' what(): Variable::addGrad: attempted to add child gradient of type f16 to a Variable of type f32. You might be performing an operation with two inputs of different types. Note, that there are no issues with the project compilation.
NB: This bug is reproduceable only on differential revision D21371575 by @jacobkahn
Reproduction Steps
Platform and Hardware
Hardware: 36-core Intel(R) Xeon(R) CPU E5-2696 v3 @ 2.30GHz / 192GB RAM / 8x NVidia GTX 2080Ti 12GB VRAM OS: Linux Ubuntu 18.04 LTS Compiler: GNU C and C++ compilers 7.5.0
Additional Context