Segmentation Fault on large model

delucca commented 1 year ago

What is happening

I've trained a large NER model using GPU (using the trf model as base). The model's folder has ~600mb in size (after trained).

The training happened in a cloud VM. I've downloaded the final model to my machine and I'm being able to execute it locally. But, when I send it to a cloud VM for predictions, when I load the model I receive the following message: Segmentation fault.

I've already tried to re-upload many times (considering that it may be due to some corrupted files) but nothing works.

This is the faulthandler message we get from it:

>>> import faulthandler
>>> faulthandler.enable()
>>> import spacy
>>> spacy.prefer_gpu()
True
>>> spacy.load('models/part-no')
Fatal Python error: Segmentation fault

Current thread 0x00007fedeac11400 (most recent call first):
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 1233 in create_module
  File "<frozen importlib._bootstrap>", line 573 in module_from_spec
  File "<frozen importlib._bootstrap>", line 676 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
  File "/home/daniel/.cache/pypoetry/virtualenvs/part-request-processor-nn9luWE0-py3.11/lib/python3.11/site-packages/torch/__init__.py", line 229 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 940 in exec_module
  File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
  File "/home/daniel/.cache/pypoetry/virtualenvs/part-request-processor-nn9luWE0-py3.11/lib/python3.11/site-packages/spacy_transformers/data_classes.py", line 3 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 940 in exec_module
  File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
  File "/home/daniel/.cache/pypoetry/virtualenvs/part-request-processor-nn9luWE0-py3.11/lib/python3.11/site-packages/spacy_transformers/layers/listener.py", line 5 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 940 in exec_module
  File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
  File "/home/daniel/.cache/pypoetry/virtualenvs/part-request-processor-nn9luWE0-py3.11/lib/python3.11/site-packages/spacy_transformers/layers/__init__.py", line 1 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 940 in exec_module
  File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
  File "/home/daniel/.cache/pypoetry/virtualenvs/part-request-processor-nn9luWE0-py3.11/lib/python3.11/site-packages/spacy_transformers/architectures.py", line 6 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 940 in exec_module
  File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1234 in _handle_fromlist
  File "/home/daniel/.cache/pypoetry/virtualenvs/part-request-processor-nn9luWE0-py3.11/lib/python3.11/site-packages/spacy_transformers/__init__.py", line 1 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 940 in exec_module
  File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
  File "<frozen importlib._bootstrap>", line 1206 in _gcd_import
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1128 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
  File "<frozen importlib._bootstrap>", line 1206 in _gcd_import
  File "/usr/local/lib/python3.11/importlib/__init__.py", line 126 in import_module
  File "/usr/local/lib/python3.11/importlib/metadata/__init__.py", line 202 in load
  File "/home/daniel/.cache/pypoetry/virtualenvs/part-request-processor-nn9luWE0-py3.11/lib/python3.11/site-packages/catalogue/__init__.py", line 134 in get_entry_points
  File "/home/daniel/.cache/pypoetry/virtualenvs/part-request-processor-nn9luWE0-py3.11/lib/python3.11/site-packages/catalogue/__init__.py", line 119 in get_all
  File "/home/daniel/.cache/pypoetry/virtualenvs/part-request-processor-nn9luWE0-py3.11/lib/python3.11/site-packages/spacy/language.py", line 162 in __init__
  File "/home/daniel/.cache/pypoetry/virtualenvs/part-request-processor-nn9luWE0-py3.11/lib/python3.11/site-packages/spacy/language.py", line 1773 in from_config
  File "/home/daniel/.cache/pypoetry/virtualenvs/part-request-processor-nn9luWE0-py3.11/lib/python3.11/site-packages/spacy/util.py", line 564 in load_model_from_config
  File "/home/daniel/.cache/pypoetry/virtualenvs/part-request-processor-nn9luWE0-py3.11/lib/python3.11/site-packages/spacy/util.py", line 516 in load_model_from_path
  File "/home/daniel/.cache/pypoetry/virtualenvs/part-request-processor-nn9luWE0-py3.11/lib/python3.11/site-packages/spacy/util.py", line 444 in load_model
  File "/home/daniel/.cache/pypoetry/virtualenvs/part-request-processor-nn9luWE0-py3.11/lib/python3.11/site-packages/spacy/__init__.py", line 54 in load
  File "<stdin>", line 1 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, srsly.ujson.ujson, srsly.msgpack._packer, srsly.msgpack._unpacker, cupy_backends.cuda.api._runtime_enum, cupy_backends.cuda.api.runtime, cupy_backends.cuda.stream, cupy_backends.cuda.libs.cublas, cupy_backends.cuda.libs.cusolver, cupy_backends.cuda._softlink, cupy_backends.cuda.libs.cusparse, cupy._util, cupy.cuda.device, fastrlock.rlock, cupy.cuda.memory_hook, cupy.cuda.graph, cupy.cuda.stream, cupy_backends.cuda.api._driver_enum, cupy_backends.cuda.api.driver, cupy.cuda.memory, cupy._core.internal, cupy._core._carray, cupy.cuda.texture, cupy.cuda.function, cupy_backends.cuda.libs.nvrtc, cupy.cuda.jitify, cupy.cuda.pinned_memory, cupy_backends.cuda.libs.curand, cupy_backends.cuda.libs.profiler, cupy.cuda.common, cupy.cuda.cub, cupy_backends.cuda.libs.nvtx, cupy.cuda.thrust, cupy._core._dtype, cupy._core._scalar, cupy._core._accelerator, cupy._core._memory_range, cupy._core._fusion_thread_local, cupy._core._kernel, cupy._core._routines_manipulation, cupy._core._optimize_config, cupy._core._cub_reduction, cupy._core._reduction, cupy._core._routines_binary, cupy._core._routines_math, cupy._core._routines_indexing, cupy._core._routines_linalg, cupy._core._routines_logic, cupy._core._routines_sorting, cupy._core._routines_statistics, cupy._core.dlpack, cupy._core.flags, cupy._core.core, cupy._core._fusion_variable, cupy._core._fusion_trace, cupy._core._fusion_kernel, cupy._core.new_fusion, cupy._core.fusion, cupy._core.raw, cupyx.cusolver, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg._cythonized_array_utils, scipy.linalg._flinalg, scipy.linalg._solve_toeplitz, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_lapack, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, cupy.cuda.cufft, cupy.fft._cache, cupy.fft._callback, cupy.random._generator_api, cupy.random._bit_generator, numpy.linalg.lapack_lite, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, cupy.lib._polynomial, blis.cy, thinc.backends.cblas, cymem.cymem, preshed.maps, blis.py, thinc.backends.linalg, murmurhash.mrmr, thinc.backends.numpy_ops, thinc.layers.sparselinear, thinc.layers.premap_ids, spacy.symbols, spacy.strings, spacy.parts_of_speech, spacy.morphology, spacy.attrs, spacy.lexeme, spacy.tokens.span, spacy.tokens.morphanalysis, spacy.tokens.token, spacy.tokens.span_group, preshed.bloom, spacy.tokens._retokenize, spacy.tokens.doc, spacy.vectors, spacy.vocab, spacy.training.align, spacy.training.alignment_array, spacy.pipeline._parser_internals.nonproj, spacy.training.example, spacy.training.gold_io, spacy.matcher.levenshtein, spacy.matcher.matcher, spacy.matcher.phrasematcher, spacy.matcher.dependencymatcher, spacy.tokenizer, spacy.pipeline.pipe, spacy.pipeline.trainable_pipe, spacy.pipeline._parser_internals.stateclass, spacy.pipeline._parser_internals.transition_system, thinc.extra.search, spacy.kb.candidate, spacy.kb.kb, spacy.kb.kb_in_memory, spacy.ml.parser_model, spacy.pipeline._parser_internals._beam_utils, spacy.pipeline.transition_parser, spacy.pipeline._parser_internals.arc_eager, spacy.pipeline.dep_parser, spacy.pipeline._edit_tree_internals.edit_trees, spacy.pipeline._parser_internals.ner, spacy.pipeline.ner, spacy.pipeline.tagger, spacy.pipeline.morphologizer, spacy.pipeline.senter, spacy.pipeline.sentencizer, charset_normalizer.md, markupsafe._speedups (total: 189)
Segmentation fault

Your Environment

Operating System: Debian 11
Python Version Used: 3.11
spaCy Version Used: 3.5.1

svlandeg commented 1 year ago

Hi! That's certainly frustrating. At least it looks like the model got trained well, considering you can run it locally, so we just need to figure out what's going on with the cloud environement. Have you compared the output of pip list between your local machine & the cloud env, and see whether anything jumps out at you? I would start with comparing versions of spacy, torch, numpy, CUDA, and cupy-cuda.

Interestingly, if the stack trace is to be believed, it looks like the segmentation fault occurs during the very first torch import. Just to rule this out - can you run a quick script on the VM which doesn't use spacy but imports torch, and see if that succeeds?

delucca commented 1 year ago

Hi! That's certainly frustrating. At least it looks like the model got trained well, considering you can run it locally, so we just need to figure out what's going on with the cloud environement. Have you compared the output of pip list between your local machine & the cloud env, and see whether anything jumps out at you? I would start with comparing versions of spacy, torch, numpy, CUDA, and cupy-cuda.

Interestingly, if the stack trace is to be believed, it looks like the segmentation fault occurs during the very first torch import. Just to rule this out - can you run a quick script on the VM which doesn't use spacy but imports torch, and see if that succeeds?

thanks for your reply 😄

yeah, after I found that stacktrace I noticed that it was related to Torch. The machine I was using was a GCP Deep Learning VM and I've noticed that it already had Torch installed (v1.3)

So, since I wasn't using Docker (the idea was just to test the model remotely) I've installed all the dependencies directly on the VM and maybe having 2 Torch versions caused this.

Either way, I've created a Docker image and with the image I was able to execute on the same machine

Feel free to close the issue or, if you want, I can provide any additional context for further debugging

svlandeg commented 1 year ago

Ah cool, thanks for reporting back, and happy to hear you got things resolved! I'll go ahead and close this :-)

github-actions[bot] commented 1 year ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

Segmentation Fault on large model #12539

What is happening

Your Environment