BLLIP / bllip-parser

BLLIP reranking parser (also known as Charniak-Johnson parser, Charniak parser, Brown reranking parser) See http://pypi.python.org/pypi/bllipparser/ for Python module.
http://bllip.cs.brown.edu/
227 stars 53 forks source link

Handling quotes #58

Closed jofatmofn closed 7 years ago

jofatmofn commented 7 years ago

Given the text John said, "Welcome to the heaven". rrp.simple_parse gives (S1 (S (NP (NNP John)) (VP (VBD said) (, ,) (`` ``) (INTJ (UH Welcome) (PP (TO to) (NP (DT the) (NN heaven)))) ('' '')) (. .))) If I use rrp.parse_tagged with the following tokens and postags

tokens=[u'John', u'said', u',', u'"', u'Welcome', u'to', u'the', u'heaven', u'"', u'.']
postags={0: u'NNP', 1: u'VBD', 2: u',', 3: u'``', 4: u'UH', 5: u'TO', 6: u'DT', 7: u'NN', 8: u"''", 9: u'.'}

it returns an empty list.

Workaround: In tokens, if I change the beginning double quotes to two backticks and ending double quotes to two apostrophe, as tokens=[u'John', u'said', u',', u'``', u'Welcome', u'to', u'the', u'heaven', u"''", u'.'] then it works.

dmcc commented 7 years ago

I think this is happening because parse_tagged() needs pretokenized text and BLLIP's tokenizer replaces quotes with their two-backtick and two-single-quote variants (this is how they're encoded in PTB format).

We could make parse_tagged call tokenize() on its input, but I thought it would be safer for users to call it first to make sure they knew what their sentence would look like after tokenization.

On Wed, Jul 12, 2017 at 12:16 AM, jofatmofn notifications@github.com wrote:

Given the text John said, "Welcome to the heaven". rrp.simple_parse gives (S1 (S (NP (NNP John)) (VP (VBD said) (, ,) ( ) (INTJ (UH Welcome) (PP (TO to) (NP (DT the) (NN heaven)))) ('' '')) (. .))) If I use rrp.parse_tagged with the following tokens and postags

tokens=[u'John', u'said', u',', u'"', u'Welcome', u'to', u'the', u'heaven', u'"', u'.'] postags={0: u'NNP', 1: u'VBD', 2: u',', 3: u'``', 4: u'UH', 5: u'TO', 6: u'DT', 7: u'NN', 8: u"''", 9: u'.'}

it returns an empty list.

Workaround: In tokens, if I change the beginning double quotes to two backticks and ending double quotes to two apostrophe, as tokens=[u'John', u'said', u',', u'``', u'Welcome', u'to', u'the', u'heaven', u"''", u'.'] then it works.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BLLIP/bllip-parser/issues/58, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm5ZREY0MpxtOP0T4wt1xC-chCCmM27ks5sNHK5gaJpZM4OVODg .

jofatmofn commented 7 years ago

Sure. Thanks. Could you please direct me to any reference (document or code) which highlights such replacements. I need to use tokens and postags from another parser and I can apply these before calling BLLIP.

dmcc commented 7 years ago

I think this more or less covers it:

ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html

There's no strict standard and each parser may interpret some edge cases slightly differently, but the main things to note for using rrp.parse_tagged are how quotes, apostrophes, and brackets are handled.

jofatmofn commented 7 years ago

Thanks.