allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.71k stars 229 forks source link

Chunking seems to not be working properly #326

Open elinantonsson opened 3 years ago

elinantonsson commented 3 years ago

Hello and thank you for creating this tool!

I have been trying to use the noun_chunks with your pipelines but it does not seem to be working correctly. I have tried with _en_core_scism, _en_core_scimd and _en_core_scilg. For example when I input the sentence "CCR5(+) and CXCR3(+) T cells are increased in multiple sclerosis and their ligands MIP-1alpha and IP-10 are expressed in demyelinating brain lesions." I only get "CCR5(+", "CXCR3(+" and "T cells" as chunks and I would expect more chunks. For example, using spaCys _en_core_webtrf I get "CCR5(+) and CXCR3(+) T cells", "multiple sclerosis", "their ligands", "MIP-1alpha", "IP-10" and "brain lesions".

Is the chunking supposed to work in a similar way as spaCys pipelines or have I misinterpreted something?

Thank you in advance!

Best regards

dakinggg commented 3 years ago

so, I think this is due to differences in the dependency parse. our dependency parser is more accurate on biomedical data (but different from spacy's), and spacy's noun chunker is defined here (https://github.com/explosion/spaCy/blob/a59f3fcf5dab3acf5570483cc314b47cc5833f39/spacy/lang/en/syntax_iterators.py#L8), with respect to specific dependency relations. See an example of the difference for your sentence below. Perhaps we should write our own noun chunker based on our dependency parser, but I am really not an expert in linguistics. You might get some mileage from adapting spacy's noun chunker based on patterns you observe from our dependency parser. Also, @DeNeutoy do you have any thoughts about this?

In [14]: [(t.text, t.pos_, t.dep_) for t in sci_doc]
Out[14]: 
[('CCR5(+', 'NOUN', 'nsubjpass'),
 (')', 'PUNCT', 'punct'),
 ('and', 'CCONJ', 'cc'),
 ('CXCR3(+', 'NOUN', 'compound'),
 (')', 'PUNCT', 'punct'),
 ('T', 'NOUN', 'compound'),
 ('cells', 'NOUN', 'conj'),
 ('are', 'VERB', 'auxpass'),
 ('increased', 'VERB', 'ROOT'),
 ('in', 'ADP', 'case'),
 ('multiple', 'ADJ', 'amod'),
 ('sclerosis', 'NOUN', 'nmod'),
 ('and', 'CCONJ', 'cc'),
 ('their', 'PRON', 'nmod:poss'),
 ('ligands', 'NOUN', 'conj'),
 ('MIP-1alpha', 'NOUN', 'dep'),
 ('and', 'CCONJ', 'cc'),
 ('IP-10', 'NOUN', 'conj'),
 ('are', 'VERB', 'auxpass'),
 ('expressed', 'VERB', 'conj'),
 ('in', 'ADP', 'case'),
 ('demyelinating', 'VERB', 'amod'),
 ('brain', 'NOUN', 'compound'),
 ('lesions', 'NOUN', 'nmod'),
 ('.', 'PUNCT', 'punct')]

In [15]: [(t.text, t.pos_, t.dep_) for t in web_doc]
Out[15]: 
[('CCR5(+', 'NOUN', 'ROOT'),
 (')', 'PUNCT', 'punct'),
 ('and', 'CCONJ', 'cc'),
 ('CXCR3(+', 'PROPN', 'npadvmod'),
 (')', 'PUNCT', 'punct'),
 ('T', 'NOUN', 'compound'),
 ('cells', 'NOUN', 'nsubjpass'),
 ('are', 'AUX', 'auxpass'),
 ('increased', 'VERB', 'conj'),
 ('in', 'ADP', 'prep'),
 ('multiple', 'ADJ', 'amod'),
 ('sclerosis', 'NOUN', 'pobj'),
 ('and', 'CCONJ', 'cc'),
 ('their', 'PRON', 'poss'),
 ('ligands', 'NOUN', 'conj'),
 ('MIP-1alpha', 'PROPN', 'appos'),
 ('and', 'CCONJ', 'cc'),
 ('IP-10', 'NUM', 'conj'),
 ('are', 'AUX', 'auxpass'),
 ('expressed', 'VERB', 'conj'),
 ('in', 'ADP', 'prep'),
 ('demyelinating', 'VERB', 'amod'),
 ('brain', 'NOUN', 'compound'),
 ('lesions', 'NOUN', 'pobj'),
 ('.', 'PUNCT', 'punct')]
dakinggg commented 3 years ago

Did you have any luck adapting the noun chunker?

elinantonsson commented 3 years ago

Sorry for my late response. Your answer was very helpful! I decided to try a different approach since I did not have enough time in my project to look for these patterns. Thank you very much!

annahmrichardson commented 3 years ago

has anyone worked on a scispacy noun chunker? thanks !