THCudaCheck FAIL file=apex/contrib/csrc/optimizers/fused_adam_cuda_kernel.cu line=226 error=98 : unrecognized error code
Traceback (most recent call last):
File "finetune.py", line 261, in <module>
optimizer.step()
File "/home/ai/anaconda3/lib/python3.6/site-packages/apex/contrib/optimizers/fp16_optimizer.py", line 154, in step
grad_norms=norm_groups)
File "/home/ai/anaconda3/lib/python3.6/site-packages/apex/contrib/optimizers/fused_adam.py", line 180, in step
group['weight_decay'])
RuntimeError: cuda runtime error (98) : unrecognized error code at apex/contrib/csrc/optimizers/fused_adam_cuda_kernel.cu:226
If you suspect this is an IPython bug, please report it at:
https://github.com/ipython/ipython/issues
or send an email to the mailing list at ipython-dev@python.org
You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.
Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
%config Application.verbose_crash=True
Segmentation fault (core dumped)
I have to use the completely FP16 training for one model with multi-gpu, but amp failed to use with DataParal in mode 'O2' or 'O3'. And I have no choice but to use FP16_Optimizer in apex.contrib. When I use FusedLayerNorm, it calls the segmentation fault too.
I have to use the completely FP16 training for one model with multi-gpu, but amp failed to use with DataParal in mode 'O2' or 'O3'. And I have no choice but to use FP16_Optimizer in apex.contrib. When I use FusedLayerNorm, it calls the segmentation fault too.
environment: cuda 10.0 pytorch 1.2.0 gcc 5.4.0
The apex is installed with