explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.08k stars 4.4k forks source link

Processing a text causes spaCy to hang #4193

Closed dhwani2410 closed 5 years ago

dhwani2410 commented 5 years ago

How to reproduce the behaviour

nlp = spacy.load("en_core_web_sm") doc_temp = data_dict[pmid] doc1=(re.sub('[^A-Za-z0-9 *-,]+', '', doc_temp)) doc = nlp(str(doc1))

This doesn't work. Also, there are many other cases where this does not work

doc1 'A placebo controlled observer blind immunocytochemical and histologic study of epithelium adjacent to anogenital warts in patients treated with systemic interferon alpha in combination with cryotherapy or cryotherapy alone OBJECTIVETo examine biopsy specimens of tissue immediately adjacent to anogenital AG warts which had been treated with either cryotherapy plus subcutaneous interferon IFN alpha 2a or cryotherapy alone, for histological features of a human papilloma virus HPV infection b localised cellular immune responses, to further characterise any cellular immune infiltrates with tissue immunocytochemistry, and to relate any histological, immunocytochemical findings to the treatment response of nearby AG wartsDESIGNA randomised placebo controlled observer blind studySETTINGGenitourinary Medicine clinic, Department of Immunopathology, Royal Victoria Hospital, Belfast, N IrelandSUBJECTSThirty patients with AG warts 16 treated with IFN alpha 2a plus cryotherapy, and 14 treated with cryotherapy aloneOUTCOME MEASURES1 Light microscopic features associated with HPV infection and local cellular immune responses 2 Indirect immunofluorescence detection of the following cell surface markers HLA DR, alpha one antitrypsin, CD1, CD3, CD4, CD8, CD22 3 Clinical response of AG warts to treatmentRESULTSIn pretreatment biopsies only non specific indicators of HPV infection acanthosis, 2930 biopsies, and hyperkeratosis, 730 biopsies were seen on light microscopy Mononuclear cells were seen both throughout the upper dermis and centred around dermal blood vessels in 1930 633 biopsies, and infiltrating into the epidermis in 1230 40 biopsies On indirect immunofluorescence CD3, CD8, CD4 antigen was detected on the surface of cells throughout the upper dermis in 2429 827, 1529 517, and 329 103, of biopsy specimens respectively CD3 antigen, CD8 antigen and CD4 antigen was detected on the surface of cells infiltrating into the epidermis in 1829 62, 729 241, and 629 207 of biopsy specimens respectively CD1 antigen was seen on the surface of dendritic cells throughout the epidermis in all specimens CD1 positive cells infiltrated into the upper dermis in 529 172 HLA DR was detected on the surface of dendritic cells throughout the epidermis in 2229 759 of specimens, and on the surface of cells scattered both diffusely throughout the upper dermis and centred around dermal blood vessels in all specimens Alpha one antitrypsin A1AT antigen was seen on the surface of cells in the upper dermis in 629 207 of biopsy specimens no cells expressing CD22 surface antigen were seen The nature of this local cellular immune response was not altered by treatment of nearby warts with either cryotherapy alone or cryotherapy plus systemic IFN alpha 2a, or related to the therapeutic outcome of these wartsCONCLUSIONS1 No convincing histological evidence of HPV infection was seen in epithelium surrounding AG warts 2 A predominantly T cellmediated immune response the target of which is uncertain was seen in this perilesional epithelium 3 In the dosage regimens used in this study, treatment of AG warts with either systemic IFN alpha 2a plus cryotherapy or cryotherapy alone did not appear to augment localised cellular immune responses against any presumed subclinical HPV infection in epithelium surrounding AG warts'

Info about spaCy

BreakBB commented 5 years ago

Could you provide some more information about your issue? Do you get any error message? What are you trying to achieve?

dhwani2410 commented 5 years ago

there are no errors, but it produces no output even if I run it for an hour. On contraray, if I just remove the last line (which doesn't seem any different to me) I find an instant result.

The same thing happened with a few more examples. Can you try running it and see if it works

ines commented 5 years ago

I'm still not 100% sure I understand the question. So when you call nlp on your text, it gets stuck?

I also noticed you're using an alpha pre-release version, 2.1.0a13. That also makes it kind of unpredictable. Could you try upgrading to the latest version of spaCy?

dhwani2410 commented 5 years ago

I have updated my spacy (base) [dhwani.dholakia@hpc ~]$ python -m spacy info

============================== Info about spaCy ==============================

spaCy version 2.1.8 Location /home/dhwani.dholakia/anaconda3/lib/python3.7/site-packages/spacy Platform Linux-2.6.32-279.el6.x86_64-x86_64-with-redhat-6.7-Santiago Python version 3.7.3 Models

still, it doesn't work. Are there any limitations in the no of lines that can be input to spacy? Because if i take out last three or four lines it works.

when I say it doesn't work means it doesn't throw any error it just gets kind of hanged in running for an infinite time with no output.

ines commented 5 years ago

when I say it doesn't work means it doesn't throw any error it just gets kind of hanged in running for an infinite time with no output.

Yes, that's exactly what I was asking – thanks for the clarification. Providing information like this is really important. If you just post that something "doesn't work", it's very difficult for people to help.

still, it doesn't work. Are there any limitations in the no of lines that can be input to spacy?

Not really, no – there is currently a limit of 1 million characters per Doc to prevent the neural network models from running out of memory. But your examples are nowhere near close to that.

I just tried it by calling nlp on the text you shared above and I can't reproduce the issue. It takes under a second for me to parse the text. But you can confirm that calling nlp(text) usin that text causes spaCy to hang, yes?

dhwani2410 commented 5 years ago

This is my system memory (base) [dhwani.dholakia@hpc Final_Scripts]$ free -h total used free shared buffers cached Mem: 62G 32G 30G 50M 1.4G 20G -/+ buffers/cache: 10G 51G Swap: 127G 296M 127G

Can you please specify any system requirements for this to run. Because when i run it in your website(https://spacy.io/) in Edit the code & try spaCyspaCy v2.1.8 · Python 3 · via Binder, it works but not in my system.

Is it possible to have memory issue if it doesn't throw an error?

what is strang is. This works just if i remove last line nlp('A placebo controlled observer blind immunocytochemical and histologic study of epithelium adjacent to anogenital warts in patients treated with systemic interferon alpha in combination with cryotherapy or cryotherapy alone OBJECTIVETo examine biopsy specimens of tissue immediately adjacent to anogenital AG warts which had been treated with either cryotherapy plus subcutaneous interferon IFN alpha 2a or cryotherapy alone, for histological features of a human papilloma virus HPV infection b localised cellular immune responses, to further characterise any cellular immune infiltrates with tissue immunocytochemistry, and to relate any histological, immunocytochemical findings to the treatment response of nearby AG wartsDESIGNA randomised placebo controlled observer blind studySETTINGGenitourinary Medicine clinic, Department of Immunopathology, Royal Victoria Hospital, Belfast, N IrelandSUBJECTSThirty patients with AG warts 16 treated with IFN alpha 2a plus cryotherapy, and 14 treated with cryotherapy aloneOUTCOME MEASURES1 Light microscopic features associated with HPV infection and local cellular immune responses 2 Indirect immunofluorescence detection of the following cell surface markers HLA DR, alpha one antitrypsin, CD1, CD3, CD4, CD8, CD22 3 Clinical response of AG warts to treatmentRESULTSIn pretreatment biopsies only non specific indicators of HPV infection acanthosis, 2930 biopsies, and hyperkeratosis, 730 biopsies were seen on light microscopy Mononuclear cells were seen both throughout the upper dermis and centred around dermal blood vessels in 1930 633 biopsies, and infiltrating into the epidermis in 1230 40 biopsies On indirect immunofluorescence CD3, CD8, CD4 antigen was detected on the surface of cells throughout the upper dermis in 2429 827, 1529 517, and 329 103, of biopsy specimens respectively CD3 antigen, CD8 antigen and CD4 antigen was detected on the surface of cells infiltrating into the epidermis in 1829 62, 729 241, and 629 207 of biopsy specimens respectively CD1 antigen was seen on the surface of dendritic cells throughout the epidermis in all specimens CD1 positive cells infiltrated into the upper dermis in 529 172 HLA DR was detected on the surface of dendritic cells throughout the epidermis in 2229 759 of specimens, and on the surface of cells scattered both diffusely throughout the upper dermis and centred around dermal blood vessels in all specimens Alpha one antitrypsin A1AT antigen was seen on the surface of cells in the upper dermis in 629 207 of biopsy specimens no cells expressing CD22 surface antigen were seen The nature of this local cellular immune response was not altered by treatment of nearby warts with either cryotherapy alone or cryotherapy plus systemic IFN alpha 2a, or related to the therapeutic outcome of these wartsCONCLUSIONS1 No convincing histological evidence of HPV infection was seen in epithelium surrounding AG warts 2 A predominantly T cellmediated immune response the target of which is uncertain was seen in this perilesional epithelium 3 In the dosage regimens used in this study, treatment of AG warts with either systemic IFN')

However if i add more words to this if doesn't work

ines commented 5 years ago

Your specs seem fine. And if the test case runs on the Binder/hosted Jupyter kernel used by the demos on the site, there's definitely no problem with it.

Is it possible to have memory issue if it doesn't throw an error?

It's unlikely you're running out of memory – at least, if that's what's happening, it'd be pretty obvious, because the rest of your machine would be running out of memory, too.

Very basic suggestion, but have you tried setting up a fresh environment from scratch?

Also, could you try setting export MKL_NUM_THREADS=1? In case you're hitting #3820.

no-response[bot] commented 5 years ago

This issue has been automatically closed because there has been no response to a request for more information from the original author. With only the information that is currently in the issue, there's not enough information to take action. If you're the original author, feel free to reopen the issue if you have or find the answers needed to investigate further.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.