explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.83k stars 4.38k forks source link

Random 'Segmentation fault (core dumped)' error when training for long spancat #13026

Open belalsalih opened 1 year ago

belalsalih commented 1 year ago

Hi, I am getting 'Segmentation fault (core dumped)' when trying to train model for long SpanCat. I know this error could be related to OOM issues but this does not seem the case here. I tried to reduce [nlp] batch_size and [training.batcher.size] as shown in the attached config file and used a VM with very large RAM to make sure we are not running out of memory. During training the VM memory usage never goes above 40% and even when reducing the [components.spancat.suggester] min_size and max_size the memory usage does not exceed 20% but the training exits with error 'Segmentation fault (core dumped)'.

Note: when training with low [components.spancat.suggester] values the training completes but with all zeroes for F, P and R.

His is the command I am using for training: python -m spacy train config_spn.cfg --output ./output_v3_lg_1.3 --paths.train ./spacy_models_v3/train_data.spacy --paths.dev ./spacy_models_v3/test_data.spacy --code functions.py -V

This is the training output:

[2023-09-28 09:25:08,461] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev'] ℹ Saving to output directory: output_v3_lg_1.3 ℹ Using CPU

=========================== Initializing pipeline =========================== [2023-09-28 09:25:08,610] [INFO] Set up nlp object from config [2023-09-28 09:25:08,618] [DEBUG] Loading corpus from path: spacy_models_v3/test_data.spacy [2023-09-28 09:25:08,618] [DEBUG] Loading corpus from path: spacy_models_v3/train_data.spacy [2023-09-28 09:25:08,619] [INFO] Pipeline: ['tok2vec', 'spancat'] [2023-09-28 09:25:08,621] [INFO] Created vocabulary [2023-09-28 09:25:09,450] [INFO] Added vectors: en_core_web_lg [2023-09-28 09:25:09,450] [INFO] Finished initializing nlp object [2023-09-28 09:25:16,150] [INFO] Initialized pipeline components: ['tok2vec', 'spancat'] ✔ Initialized pipeline

============================= Training pipeline ============================= [2023-09-28 09:25:16,158] [DEBUG] Loading corpus from path: spacy_models_v3/test_data.spacy [2023-09-28 09:25:16,159] [DEBUG] Loading corpus from path: spacy_models_v3/train_data.spacy ℹ Pipeline: ['tok2vec', 'spancat'] ℹ Initial learn rate: 0.001 E # LOSS TOK2VEC LOSS SPANCAT SPANS_SC_F SPANS_SC_P SPANS_SC_R SCORE


0 0 98109.47 19535.08 0.00 0.00 4.58 0.00 0 200 528.73 781.51 0.00 0.00 3.75 0.00 Segmentation fault (core dumped)

Environment:

Operating System: Ubuntu 20.04.6 LTS Python Version Used: 3.8.10 spaCy Version Used: 3.6.0 config_spn.cfg.txt

Thanks in advance!

shadeMe commented 1 year ago

A segmentation fault shouldn't be happening under any circumstances. Could you post the output of the following command?

pip list

Furthermore, I'd appreciate it if you could you try the following for me:

belalsalih commented 1 year ago

Thanks for the reply, I have created a new venv with only spacy, however I am still getting the same error so this is not related to pip packages. I am using a small sample data(300 docs) for training and validation.

Noticed one thing: Changing the with in [components.tok2vec.model.encode] from the default 96 to 128 will make the training command complete one iteration then crash, changing this value back to 96 will cause the command to fail without completing any iterations.

Attached debug data output FYR. debug_data.txt

pip list output:

Package             Version
------------------- ---------
attrs               23.1.0
azure-core          1.28.0
azure-storage-blob  12.17.0
blis                0.7.9
catalogue           2.0.8
certifi             2023.5.7
cffi                1.15.1
charset-normalizer  3.2.0
click               8.1.5
confection          0.1.0
contourpy           1.1.0
cryptography        41.0.2
cycler              0.11.0
cymem               2.0.7
en-core-web-lg      3.6.0
en-core-web-sm      3.6.0
fonttools           4.41.1
fuzzysearch         0.7.3
fuzzywuzzy          0.18.0
idna                3.4
importlib-resources 6.0.0
isodate             0.6.1
Jinja2              3.1.2
joblib              1.3.1
kiwisolver          1.4.4
langcodes           3.3.0
Levenshtein         0.21.1
MarkupSafe          2.1.3
matplotlib          3.7.2
murmurhash          1.0.9
numpy               1.24.4
packaging           23.1
pandas              2.0.3
pathy               0.10.2
Pillow              10.0.0
pip                 23.2
pkg_resources       0.0.0
preshed             3.0.8
pycparser           2.21
pydantic            1.10.11
pyodbc              4.0.39
pyparsing           3.0.9
python-dateutil     2.8.2
python-Levenshtein  0.21.1
pytz                2023.3
rapidfuzz           3.2.0
regex               2023.8.8
requests            2.31.0
scikit-learn        1.3.0
scipy               1.10.1
setuptools          68.0.0
six                 1.16.0
sklearn             0.0.post7
smart-open          6.3.0
spacy               3.6.0
spacy-legacy        3.0.12
spacy-loggers       1.0.4
srsly               2.4.6
thefuzz             0.20.0
thinc               8.1.10
threadpoolctl       3.2.0
tqdm                4.65.0
typer               0.9.0
typing_extensions   4.7.1
tzdata              2023.3
urllib3             2.0.3
wasabi              1.1.2
wheel               0.40.0
zipp                3.16.2
shadeMe commented 1 year ago

Thanks for the info - We'll investigate.

belalsalih commented 11 months ago

To anyone facing this issue, I've used NER instead SpanCat and I had no issues. And for overlapping spans I've trained the model to extract the high level details and trained separate models to extract sub-details from complex data. I still believe SpanCat is the right way to do it if it worked as intended.

Regards.

shadeMe commented 11 months ago

Hi, can you share the training/dev data and the custom code you were using to train the SpanCat model? We'd need that to reproduce the crash and debug the issue.

belalsalih commented 11 months ago

Hi, I got this issue while creating a CV parser for our clients, so unfortunately we cannot share the data since it is using live applicants data. We are not using any custom code to train the model, we are generating the training and saving the training.spacy/dev.spacy on the fly. The same data that is causing this error is working fine when using NER instead of SapnCat, so I don't think this is data issue as you can see in the debug data shared earlier. You can check this discussion thread related to this issue 13012.

Regards.

shadeMe commented 11 months ago

That's understandable. The issue is likely a bug in the SpanCat component's code, but we still need to consistently reproduce the crash in order to identify the cause and fix it. If you run into this issue in the future where you can share the data that triggers the crash, please let us know.