Closed samirgupta closed 7 years ago
Yes, this is supported! You'll want to use parse_tagged()
instead of parse()
:
You can also parse text with existing POS tags (these act as soft constraints). In this example, token 0 ('Time') should have tag VB and token 1 ('flies') should have tag NNS:
rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : 'VB', 1 : 'NNS'})[0] ScoredParse('(S1 (NP (VB Time) (NNS flies)))', parser_score=-54.05083561918019, reranker_score=-15.079632500107973)
You don't need to specify a tag for all words: Here, token 0 ('Time') should have tag VB and token 1 ('flies') is unconstrained:
rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : 'VB'})[0] ScoredParse('(S1 (S (VP (VB Time) (NP (VBZ flies)))))', parser_score=-54.3497715 5750189, reranker_score=-16.681734375725263)
Thanks. This is very helpful.
I am using the GENIA+PubMed model and parsing biomedical text.
A very frequent issue I have observed is that certain biomedical entities such as microRNAs (miRs) are being tagged as CD rather than NN.
Example: Sentence: (named entities highlighted in bold) Human micro-RNAs miR-223, miR-26b, miR-221, miR-103-1, miR-185, miR-23b,miR-203, miR-17-5p, miR-23a, and miR-205 were significantly up-regulated in bladder cancers.
(S1 (S (S (NP (NP (NP (JJ Human) (NNS micro-RNAs)) (QP (CD miR-223) (, ,) (CD miR-26b) (, ,) (CD miR-221) (, ,) (CD miR-103-1) (, ,) (CD miR-185) (, ,) (CD miR-23b) (, ,) (CD miR-203))) (, ,) (NP (NP (NN miR-17-5p)) (, ,) (NP (NN miR-23a)) (, ,) (CC and) (NP (NN miR-205)))) (VP (VBD were) (ADVP (RB significantly)) (VP (VBN up-regulated) (PP (IN in) (NP (JJ bladder) (NNS cancers)))))) (. .)))
I have regular expression for identifying such miRs named entities. Is there a way to force these entities to recognized as NN and thus part of NP instead of QP?
Note: Since I am these parse trees generated by Bllip to get Universal Dependencies using Stanford Typed Dependency Converter, dependencies between these entities (incorrectly tagged as CDs) are being incorrectly identified.