Closed Hyperparticle closed 5 years ago
Upgraded to CUDA 10.0 and PyTorch 1.0.1, now I get a segmentation fault with Apex enabled.
I also have this error (not on pytorch-bert). Same setup (CUDA 10 and latest PyTorch 1.0.1).
Me too - PyTorch 1.0.1, CUDA 10. It's not specific to pytorch-pretrained-BERT
, the script below is enough for me:
import torch
import apex
input = torch.rand(3, 10).cuda()
fln = apex.normalization.FusedLayerNorm(10).cuda()
fln(input)
I got this example to fail on a V100 too. I've now also tested on a k80 and this example works well with CUDA 10 and pytorch 1.0.1.post2 🤔
@geniki @thomwolf Strange, I don't get any errors with the script above, but I still get the runtime error when running pytorch-pretrained-BERT
(using Titan RTX).
@geniki
Me too - PyTorch 1.0.1, CUDA 10. It's not specific to
pytorch-pretrained-BERT
, the script below is enough for me:import torch import apex input = torch.rand(3, 10).cuda() fln = apex.normalization.FusedLayerNorm(10).cuda() fln(input)
When I run this^ I get:
ModuleNotFoundError: No module named 'fused_layer_norm_cuda'
Also getting it on a pytorch-pretrained-BERT
experiment.
Not sure if these issues (mine and the one originally posted) are related though...
@mrdbourke I think you may have compiled apex without cuda support. You need to compile it with python setup.py install --cpp_ext --cuda_ext.
@mrdbourke I think you may have compiled apex without cuda support. You need to compile it with python setup.py install --cpp_ext --cuda_ext.
Thank you, just realised I didn't use the extension... my bad.
This fixed it.
@mcarilli any hint on a possible source of error from you guys?
Sorry for the delayed response, my bandwidth right now is completely consumed cleaning up the mixed precision API (https://github.com/NVIDIA/apex/compare/api_refactor?expand=1).** I didn't write FusedLayerNorm (it came in from our MLPerf efforts) and I haven't had time to debug it. @thorjohnsen is currently using it in our own implementation of BERT.
@geniki Thank you for the minimal repro. @Hyperparticle @thomwolf When you say "I get a segmentation fault with Apex enabled in https://github.com/NVIDIA/apex/issues/156#issuecomment-464115433, do you mean the segmentation fault occurs specifically when you try to use FusedLayerNorm, or at some other point?
**Unrelated, but useful: I'll be presenting a preview of the new API in a webinar tomorrow. It's working, but I don't have documentation or examples yet. I will add it to master by next week.
@mcarilli I can confirm that I do get the segmentation fault when calling the FusedLayerNorm code, but I haven't investigated exactly where. I don't get one when I use regular LayerNorm.
@Hyperparticle @thomwolf @geniki While I wait for the results of Thor's runs, one thing that occurs to me is your segfault may be because when you upgraded Pytorch, the existing (installed) Apex binaries were no longer compatible somehow. Try a full pip uninstall apex
, then cd apex_repo_dir; rm-rf build; pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
and see if the segfault persists.
@mcarilli Thanks, that fixed the segfault. But now I still get the same FusedLayerNorm
error.
@geniki The mini repro runs fine on my setup (cuda 10 with v100)
@Hyperparticle Can you provide some more information on how to repro this issue? which pretrained model are you using?
A script (if possible) with the repro would be of great help.
Thanks @mcarilli. This fixed it for me - at least the snipped I posted above. @Hyperparticle does the snipped above run for you?
@geniki @jjsjann123 The snippet works, but I'm still seeing an error for my use-case. I'm running the tutorial code from this section in pytorch-pretrained-BERT
with apex
enabled. I'll try to debug it and get a minimal code snippet extracted with the tensor operation.
Thanks a lot. We are having a hard time reproducing the bug. Having a repro script would make it much faster for us to debug the problem. Looking forward to your update.
@jjsjann123 This is basically what the code is doing:
import torch
import apex
import importlib
fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
input_ = torch.rand([32, 63, 768]).cuda()
weight_ = torch.rand(768).cuda()
bias_ = torch.rand(768).cuda()
normalized_shape = weight_.size()
eps = 1e-12
output, mean, invvar = fused_layer_norm_cuda.forward_affine(input_, normalized_shape, weight_, bias_, eps)
My GPU is now unavailable, so I can't verify if this causes the problem. If not, then it could either be the values in the tensors that are the problem (which I will have to save and upload somewhere), or some other extraneous property of the tensors.
root@d0c3981dfbe3:/workspace# cat repro.py
import torch
import apex
import importlib
fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
input_ = torch.rand([32, 63, 768]).cuda()
weight_ = torch.rand(768).cuda()
bias_ = torch.rand(768).cuda()
normalized_shape = weight_.size()
eps = 1e-12
output, mean, invvar = fused_layer_norm_cuda.forward_affine(input_, normalized_shape, weight_, bias_, eps)
torch.cuda.synchronize()
root@d0c3981dfbe3:/workspace# python repro.py
root@d0c3981dfbe3:/workspace#
This is working fine for me as well :(
Seems like recompiling apex cleanly like @mcarilli indicated fixed the problem for me also! Both @geniki and @Hyperparticle examples works at my place (as well as my current project). Thanks a lot!
@thomwolf Well that sounds like a relief. As for me, I'll have to see if the old code is still lingering somewhere on my system. I'll have to test it in a couple days. @jjsjann123 If it works for others, then you can close this issue.
I'll close the issue and feel free to open a new one and ping me on that if things don't work out for you @Hyperparticle
Whew, this is a useful gotcha to know about. good old emergency repair procedure number one: turn it off and on again. Glad people seem to be happy, especially since as I said, I don't have the bandwidth to do a deep dive debug right this second.
Note to self: make the setup.py smarter to avoid such cases in the future.
@mrdbourke I think you may have compiled apex without cuda support. You need to compile it with python setup.py install --cpp_ext --cuda_ext.
I cannot use pip to install apex, your method works for me
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
i do like this but also get the segment fault
@wyx518 Do you get the seg fault while running some python script using apex/amp
or during the install?
Either way, could you post the complete error message with the stack trace so that we can have a look?
@ptrblck First,I run my own demo using pytorch-pretrained-BERT but get this run.sh: line 3: 21713 Segmentation fault (core dumped) . Then I run the code @jjsjann123 offered, also get the
I solved the problem, it's the version of GCC . It should be 4.9+,but ubuntu 14.04 is 4.8.
@mcarilli Could you please tell me how to find "apex_repo_dir" and then "cd apex_repo_dir",I find it all the time ,but cannot figure it out ,thanks
Upgraded to CUDA 10.0 and PyTorch 1.0.1, now I get a segmentation fault with Apex enabled.
I also get a segmentation fault with Apex enabled, CUDA 9,0 and PyTorch 1.1.0
Running fp16 models via fairseq and getting a segmentation fault with pytorch 1.4.0, gcc/6.3.0, cuda/10.1.105
Is there a way to install apex on a windows machine with "--cpp_ext"
and "--cuda_ext"
? At the moment I can't, and as far as I can tell that's a general issue with windows?
After installing
apex
with the cuda extensions and running pytorch-pretrained-BERT, I get the following error inFusedLayerNormAffineFunction
, apex/normalization/fused_layer_norm.py (line 21).Here are the shapes of my tensors:
I'm not sure if it's a problem with
pytorch-pretrained-BERT
calling it incorrectly or a bug inapex
. Any idea? I've also created an issue here.I'm running Ubuntu with CUDA 9, PyTorch 0.4.1.
Full stacktrace below.