flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

Mixed precision run-time error (Differential revision D21371575) #759

Open desync-securax opened 4 years ago

desync-securax commented 4 years ago

Bug Description

While trying to start training, using the AMP, the following run-time error is being induced: _Epoch 1 started! terminate called after throwing an instance of 'std::invalidargument' what(): Variable::addGrad: attempted to add child gradient of type f16 to a Variable of type f32. You might be performing an operation with two inputs of different types. Note, that there are no issues with the project compilation.

NB: This bug is reproduceable only on differential revision D21371575 by @jacobkahn

Reproduction Steps

Platform and Hardware

Hardware: 36-core Intel(R) Xeon(R) CPU E5-2696 v3 @ 2.30GHz / 192GB RAM / 8x NVidia GTX 2080Ti 12GB VRAM OS: Linux Ubuntu 18.04 LTS Compiler: GNU C and C++ compilers 7.5.0

Additional Context

jacobkahn commented 4 years ago

@desync-securax — thanks for reporting. I'm still ironing out operators in which we're doing implicit casts (and making these explicit) given some changes to the AMP code. I'm hoping to commit all of these changes in the next week or two. Note that these will differ from the original changes from mtmd in that operators will be strongly typed at forward time, meaning you won't get cryptic errors like this when calling backward.

Let me know if you have any other questions. I'll leave this issue open to track this as I start committing those changes. For now, you can continue using mtmd's branches as needed.