explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.71k stars 4.36k forks source link

Segmentation fault if text too long or doc parsed twice #1826

Closed sanjeeku closed 5 years ago

sanjeeku commented 6 years ago

I just installed Spacy 2.0.5 in Python 3.6.4 (that Anaconda). I also installed the default model ('en') Spacy is giving seg fault when I try to load my text file (it is about 2MB in size).

Here's the code the reproduces it: Python 3.6.4 |Anaconda, Inc.| (default, Dec 21 2017, 21:42:08) [GCC 7.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import spacy nlp = spacy.load('en') with open('wf.txt') as f: ... text = f.read() ... doc = nlp(text) Segmentation fault (core dumped)

I tried with another text file (slightly larger though) with the same result.

Info about spaCy

Is there any other information I can provide to troubleshoot this seg fault?

sanjeeku commented 6 years ago

An update: I tried on a new/clean aws instance where I installed Spacy differently (using conda forge). I still got the same seg fault.

Here's the environment info: (py3) ubuntu@ip-172-31-16-211:~$ python -m spacy info --markdown

Info about spaCy

sanjeeku commented 6 years ago

Further update: I had an old conda env with spacy 1.9.0 installed. Both text files were parsed perfectly. So the SegFault issue is only with Spacy 2.0 or later (I have tested with 2.0.4 and 2.0.5)

godelstheory commented 6 years ago

I am experiencing a similar issue, though it occurs when using the English language model parse method. The problem occurs < 1% of the time in a corpus of 250K documents, but I have yet to determine its root cause. An example paragraph is shown below.

Similarly, the problem occurs in 2.0.5, but is not present in 1.9.0. I have reproduced this across multiple machines.

import spacy

nlp  = spacy.load('en')
text = u'This will be a prospective study in patients aged between 18 and 70 years old who have already been screened and planned for elective bariatric surgery. In bariatric surgery, a large portion of the stomach will be removed. Pneumoperitoneum is also known as the abdominal pressure which will be the experimental aspect in this study. Laparoscopy surgery will be performed by introducing the camera (optical trocar) after making an incision at the belly button (umbilicus), and carbon dioxide which will be given at a rate of 5 L/min until the intra abdominal pressure of either 8 10 mmHg (low pressure group) or 12 15 mmHg (standard pressure group) is achieved. The remaining three standard ports will be placed and the laparoscopic sleeve gastrectomy will be performed at an insufflation rate of 15 L/min. The greater omentum will be divided at the greater curvature of the stomach using an ultrasonic dissector, beginning from the proximal antrum until the fundus. The omentum will be divided close to the stomach wall hence preserving the gastro epiploic vessels. Short gastric vessels will be divided entirely from the stomach and this dissection will continue until the left crus of the diaphragm are exposed. Endoscopic staplers will then be used to staple and divide the stomach until the angle of His. A 39Fr gastric calibration tube will be placed along the lesser curvature of the stomach, acts as a guide during the division of the stomach. Finally, the divided stomach will be removed through a 12mm port site and the incision will be closed with sutures. Towards the end of the surgery, all residual pneumoperitoneum will be evacuated by keeping the trocar valves open under direct telescopic vision. The duration of surgery or any intraoperative complications will be recorded. The starting of surgery will be regarded after the induction of anaesthesia and the end of surgery is regarded when the end of skin closure. Operating field or also known as surgical view is defined as the view of the intra abdomen. A clear operating field allows a good working space for the surgeon. Numeric rating score will be used to access the operating field during the surgery. Post operative pain will be rated on a Visual Analog Scale at rest and with movement.'

doc = nlp(text) # works
nlp.parse(doc) # Segmentation fault/core dump
sanjeeku commented 6 years ago

Confirming that this bug continues to exists in Spacy v2.0.7 @honnibal - I am happy to privately send you the text files on which it is bombing. Please let me know where to send.

honnibal commented 6 years ago

@sanjeeku Thanks, could you mail to matt@explosion.ai ?

sanjeeku commented 6 years ago

@honnibal -- Just emailed the text file.

honnibal commented 6 years ago

@sanjeeku Thanks for the text, finally got to this.

Your issue is simply that the text is too long. This is rather frustrating --- I wish we used less temporary memory per word than the neural networks currently do. However, I don't see a way around this without significantly impacting performance.

I've added an error message and added an option on the Language class to note the problem. In your case, the solution is very simple: just process each newline individually.

@godelstheory Your problem is different. I think the problem occurs from parsing the text twice. This shouldn't cause a segfault, but as a workaround, you can avoid doing that for now? You can verify that the double-parsing is the problem by changing the first line to doc = nlp.make_doc(text).

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.