Closed delucca closed 1 year ago
Hi! That's certainly frustrating. At least it looks like the model got trained well, considering you can run it locally, so we just need to figure out what's going on with the cloud environement. Have you compared the output of pip list
between your local machine & the cloud env, and see whether anything jumps out at you? I would start with comparing versions of spacy
, torch
, numpy
, CUDA, and cupy-cuda
.
Interestingly, if the stack trace is to be believed, it looks like the segmentation fault occurs during the very first torch
import. Just to rule this out - can you run a quick script on the VM which doesn't use spacy
but imports torch
, and see if that succeeds?
Hi! That's certainly frustrating. At least it looks like the model got trained well, considering you can run it locally, so we just need to figure out what's going on with the cloud environement. Have you compared the output of
pip list
between your local machine & the cloud env, and see whether anything jumps out at you? I would start with comparing versions ofspacy
,torch
,numpy
, CUDA, andcupy-cuda
.Interestingly, if the stack trace is to be believed, it looks like the segmentation fault occurs during the very first
torch
import. Just to rule this out - can you run a quick script on the VM which doesn't usespacy
but importstorch
, and see if that succeeds?
thanks for your reply 😄
yeah, after I found that stacktrace I noticed that it was related to Torch. The machine I was using was a GCP Deep Learning VM and I've noticed that it already had Torch installed (v1.3)
So, since I wasn't using Docker (the idea was just to test the model remotely) I've installed all the dependencies directly on the VM and maybe having 2 Torch versions caused this.
Either way, I've created a Docker image and with the image I was able to execute on the same machine
Feel free to close the issue or, if you want, I can provide any additional context for further debugging
Ah cool, thanks for reporting back, and happy to hear you got things resolved! I'll go ahead and close this :-)
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
What is happening
I've trained a large NER model using GPU (using the
trf
model as base). The model's folder has ~600mb in size (after trained).The training happened in a cloud VM. I've downloaded the final model to my machine and I'm being able to execute it locally. But, when I send it to a cloud VM for predictions, when I load the model I receive the following message:
Segmentation fault
.I've already tried to re-upload many times (considering that it may be due to some corrupted files) but nothing works.
This is the faulthandler message we get from it:
Your Environment