Biomedical named entities being treated as Cardinals

BLLIP / bllip-parser

BLLIP reranking parser (also known as Charniak-Johnson parser, Charniak parser, Brown reranking parser) See http://pypi.python.org/pypi/bllipparser/ for Python module.

227 stars 53 forks source link

I am using the GENIA+PubMed model and parsing biomedical text.

A very frequent issue I have observed is that certain biomedical entities such as microRNAs (miRs) are being tagged as CD rather than NN.

Example: Sentence: (named entities highlighted in bold) Human micro-RNAs miR-223, miR-26b, miR-221, miR-103-1, miR-185, miR-23b,miR-203, miR-17-5p, miR-23a, and miR-205 were significantly up-regulated in bladder cancers.

(S1 (S (S (NP (NP (NP (JJ Human) (NNS micro-RNAs)) (QP (CD miR-223) (, ,) (CD miR-26b) (, ,) (CD miR-221) (, ,) (CD miR-103-1) (, ,) (CD miR-185) (, ,) (CD miR-23b) (, ,) (CD miR-203))) (, ,) (NP (NP (NN miR-17-5p)) (, ,) (NP (NN miR-23a)) (, ,) (CC and) (NP (NN miR-205)))) (VP (VBD were) (ADVP (RB significantly)) (VP (VBN up-regulated) (PP (IN in) (NP (JJ bladder) (NNS cancers)))))) (. .)))

I have regular expression for identifying such miRs named entities. Is there a way to force these entities to recognized as NN and thus part of NP instead of QP?

Note: Since I am these parse trees generated by Bllip to get Universal Dependencies using Stanford Typed Dependency Converter, dependencies between these entities (incorrectly tagged as CDs) are being incorrectly identified.

Yes, this is supported! You'll want to use parse_tagged() instead of parse():

You can also parse text with existing POS tags (these act as soft constraints). In this example, token 0 ('Time') should have tag VB and token 1 ('flies') should have tag NNS:

rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : 'VB', 1 : 'NNS'})[0] ScoredParse('(S1 (NP (VB Time) (NNS flies)))', parser_score=-54.05083561918019, reranker_score=-15.079632500107973)

You don't need to specify a tag for all words: Here, token 0 ('Time') should have tag VB and token 1 ('flies') is unconstrained:

rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : 'VB'})[0] ScoredParse('(S1 (S (VP (VB Time) (NP (VBZ flies)))))', parser_score=-54.3497715 5750189, reranker_score=-16.681734375725263)

BLLIP / bllip-parser

Biomedical named entities being treated as Cardinals #54