BLLIP / bllip-parser

BLLIP reranking parser (also known as Charniak-Johnson parser, Charniak parser, Brown reranking parser) See http://pypi.python.org/pypi/bllipparser/ for Python module.
http://bllip.cs.brown.edu/
227 stars 53 forks source link

Biomedical named entities being treated as Cardinals #54

Closed samirgupta closed 7 years ago

samirgupta commented 7 years ago

I am using the GENIA+PubMed model and parsing biomedical text.

A very frequent issue I have observed is that certain biomedical entities such as microRNAs (miRs) are being tagged as CD rather than NN.

Example: Sentence: (named entities highlighted in bold) Human micro-RNAs miR-223, miR-26b, miR-221, miR-103-1, miR-185, miR-23b,miR-203, miR-17-5p, miR-23a, and miR-205 were significantly up-regulated in bladder cancers.

(S1 (S (S (NP (NP (NP (JJ Human) (NNS micro-RNAs)) (QP (CD miR-223) (, ,) (CD miR-26b) (, ,) (CD miR-221) (, ,) (CD miR-103-1) (, ,) (CD miR-185) (, ,) (CD miR-23b) (, ,) (CD miR-203))) (, ,) (NP (NP (NN miR-17-5p)) (, ,) (NP (NN miR-23a)) (, ,) (CC and) (NP (NN miR-205)))) (VP (VBD were) (ADVP (RB significantly)) (VP (VBN up-regulated) (PP (IN in) (NP (JJ bladder) (NNS cancers)))))) (. .)))

I have regular expression for identifying such miRs named entities. Is there a way to force these entities to recognized as NN and thus part of NP instead of QP?

Note: Since I am these parse trees generated by Bllip to get Universal Dependencies using Stanford Typed Dependency Converter, dependencies between these entities (incorrectly tagged as CDs) are being incorrectly identified.

dmcc commented 7 years ago

Yes, this is supported! You'll want to use parse_tagged() instead of parse():

You can also parse text with existing POS tags (these act as soft constraints). In this example, token 0 ('Time') should have tag VB and token 1 ('flies') should have tag NNS:

rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : 'VB', 1 : 'NNS'})[0] ScoredParse('(S1 (NP (VB Time) (NNS flies)))', parser_score=-54.05083561918019, reranker_score=-15.079632500107973)

You don't need to specify a tag for all words: Here, token 0 ('Time') should have tag VB and token 1 ('flies') is unconstrained:

rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : 'VB'})[0] ScoredParse('(S1 (S (VP (VB Time) (NP (VBZ flies)))))', parser_score=-54.3497715 5750189, reranker_score=-16.681734375725263)

samirgupta commented 7 years ago

Thanks. This is very helpful.