dmcc / PyStanfordDependencies

Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies
https://pypi.python.org/pypi/PyStanfordDependencies
68 stars 17 forks source link

Sentence.from_stanford_dependencies() fails on collapsed (enhanced) dependency strings #14

Closed ccsasuke closed 8 years ago

ccsasuke commented 8 years ago

Below is an example where the function fails at assertion: assert len(matches) == 1 (CoNLL.py, line 209)

Universal dependencies, enhanced nsubj(reach-3, Visitors-1) nsubj(reach-3', Visitors-1) aux(reach-3, can-2) root(ROOT-0, reach-3) conj:and(reach-3, reach-3') dobj(reach-3, it-4) advmod(reach-3, only-5) case(escort-9, under-6) amod(escort-9, strict-7) amod(escort-9, military-8) nmod:under(reach-3, escort-9) cc(reach-3, and-10) case(permission-13, with-11) amod(permission-13, prior-12) nmod:with(reach-3', permission-13) case(Pentagon-16, from-14) det(Pentagon-16, the-15) nmod:from(permission-13, Pentagon-16) case(flights-22, aboard-18) amod(flights-22, special-19) amod(flights-22, small-20) compound(flights-22, shuttle-21) nmod:aboard(reach-3, flights-22) nsubj(reach-24, flights-22) ref(flights-22, that-23) acl:relcl(flights-22, reach-24) det(base-26, the-25) dobj(reach-24, base-26) case(flight-30, by-27) det(flight-30, a-28) amod(flight-30, circuitous-29) nmod:by(reach-24, flight-30) case(States-34, from-31) det(States-34, the-32) compound(States-34, United-33) nmod:from(flight-30, States-34)

My guess is that relations such as nsubj(reach-3', Visitors-1) are not catched by the regex. Am I missing anything? Thanks!

dmcc commented 8 years ago

Thanks for the report! Yeah, the deps_re regex doesn't like the ' in 3'. That said, I'm not sure what the ' means in this case -- I can't tell if that's a bug in UD or some feature of UD/Stanford's UD converter that I'm not familiar with.

Can you send the original PTB tree that was used for conversion?

ccsasuke commented 8 years ago

Hi, the sentence was: Visitors can reach it only under strict military escort and with prior permission from the Pentagon , aboard special small shuttle flights that reach the base by a circuitous flight from the United States .

The online Stanford parser demo (http://nlp.stanford.edu:8080/parser/index.jsp) gave the aforementioned dependency string.

The parse tree is as below: (ROOT (S (NP (NNS Visitors)) (VP (MD can) (VP (VB reach) (NP (PRP it)) (ADVP (RB only)) (PP (PP (IN under) (NP (JJ strict) (JJ military) (NN escort))) (CC and) (PP (IN with) (NP (NP (JJ prior) (NN permission)) (PP (IN from) (NP (DT the) (NNP Pentagon)))))) (, ,) (PP (IN aboard) (NP (NP (JJ special) (JJ small) (NN shuttle) (NNS flights)) (SBAR (WHNP (WDT that)) (S (VP (VBP reach) (NP (DT the) (NN base)) (PP (IN by) (NP (NP (DT a) (JJ circuitous) (NN flight)) (PP (IN from) (NP (DT the) (NNP United) (NNPS States)))))))))))) (. .)))

Thanks!

ccsasuke commented 8 years ago

Oops, the indentation was not kept here. Perhaps you can simply enter the sentence to the Stanford online demo to see the PTB tree and dependency strings.

dmcc commented 8 years ago

Thanks for the tree. I've been able to replicate the bug. I'm curious how the CoreNLP online demo would render this sentence, but unfortunately it looks like it's currently down. Either way, I sent in a report to Stanford about the UD output.

dmcc commented 8 years ago

Aha -- it is a feature, not a bug: https://github.com/stanfordnlp/CoreNLP/issues/81#issuecomment-122018883

Seems that a fix might be to update the regex to allow for 's (but then not might include them in the regex group). Not sure if I'll be able to get this done and properly tested before EMNLP, but it shouldn't be too hard to change.

dmcc commented 8 years ago

This issue should be fixed in the main branch, @ccsasuke. Please reopen if you find cases where it doesn't work. Thanks again!

ccsasuke commented 8 years ago

@dmcc Thanks for the prompt fix!

However, it still fails on some of my sentences. The main two cases I noted were as follows: i) A node can be copied multiple times. See the tree below given my the Stanford online parser demo: (ROOT (S (NP (NP (DT A) (NN total)) (PP (IN of) (NP (NP (QP (CD 17) (CD million)) (JJ metric) (NNS tons)) (PP (IN of) (NP (NNS potatoes)))))) (VP (VBD was) (VP (VBN produced) (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBD was) (ADJP (NP (CD 14) (NN %)) (JJR less) (PP (PP (IN than) (NP (NP (NP (JJ last) (NN year)) (PRN (-LRB- -LRB-) (NP (NP (CD 106) (NNS quintals)) (PP (IN per) (NP (NN hectare)))) (-RRB- -RRB-))) (, ,) (CC and) (NP (NP (QP (CD 5.4) (CD million)) (JJ metric) (NNS tons)) (PP (IN of) (NP (NNS vegetables)))))) (, ,) (CC or) (ADVP (NP (CD 2.2) (NN %)) (RBR more)) (PP (IN than) (PP (IN on) (NP (DT the) (JJ same) (NN date)) (NP (JJ last) (NN year)))))) (PRN (-LRB- -LRB-) (NP (NP (JJ 116) (NNS quintals)) (PP (IN per) (NP (NN hectare)))) (-RRB- -RRB-))))))) (. .)))

Dependencies:

det(total-2, A-1) nsubjpass(produced-11, total-2) case(tons-7, of-3) compound(million-5, 17-4) nummod(tons-7, million-5) amod(tons-7, metric-6) nmod:of(total-2, tons-7) case(potatoes-9, of-8) nmod:of(tons-7, potatoes-9) auxpass(produced-11, was-10) root(ROOT-0, produced-11) nsubj(less-17, which-13) nsubj(less-17', which-13) nsubj(less-17'', which-13) nsubj(less-17''', which-13) nsubj(less-17'''', which-13) cop(less-17, was-14) nummod(%-16, 14-15) nmod:npmod(less-17, %-16) ccomp(produced-11, less-17) ccomp(produced-11, less-17') ccomp(produced-11, less-17'') ccomp(produced-11, less-17''') ccomp(produced-11, less-17'''') conj:and(less-17, less-17''') conj:and(less-17, less-17'''') conj:or(less-17, less-17') conj:or(less-17, less-17'') case(year-20, than-18) amod(year-20, last-19) nmod:than(less-17, year-20) nummod(quintals-23, 106-22) dep(year-20, quintals-23) case(hectare-25, per-24) nmod:per(quintals-23, hectare-25) cc(less-17, and-28) compound(million-30, 5.4-29) nummod(tons-32, million-30) amod(tons-32, metric-31) conj(year-20, tons-32) case(vegetables-34, of-33) nmod:of(tons-32, vegetables-34) cc(less-17, or-36) nummod(%-38, 2.2-37) nmod:npmod(more-39, %-38) conj(year-20, more-39) case(date-44, than-40) case(date-44, on-41) det(date-44, the-42) amod(date-44, same-43) nmod:on(less-17', date-44) amod(year-46, last-45) nmod:tmod(date-44, year-46) amod(quintals-49, 116-48) dep(less-17, quintals-49) case(hectare-51, per-50) nmod:per(quintals-49, hectare-51)

ii) It seems the code is not handling escaped characters well (e.g. \/) Stanford CoreNLP escapes '/' in both dependencies and parse trees, but it seems the code only cleaned up the ones in the tree, leading to assertion failure at line 217 in CoNLL.py For instance, it fails at the following sentence:

(ROOT (NP (NP (NNP Hanoi) (, ,) (NNP May) (CD 13)) (PRN (-LRB- -LRB-) (NP (NNP VNA)) (-RRB- -RRB-)) (: --) (NP (NP (NNP Vietnam)) (SBAR (S (VP (VBZ has) (VP (VBN produced) (NP (NP (DT a) (NN variety)) (PP (IN of) (NP (NNS drugs)))) (S (VP (TO to) (VP (VB control) (NP (NNS HIV\/AIDS)) (PP (IN in) (NP (NP (NNS patients)) (VP (VBG suffering) (PP (IN with) (NP (DT the) (NN disease)))))))))))))) (. .)))

Dependencies:

compound(May-3, Hanoi-1) root(ROOT-0, May-3) nummod(May-3, 13-4) appos(May-3, VNA-6) dep(May-3, Vietnam-9) aux(produced-11, has-10) acl:relcl(Vietnam-9, produced-11) det(variety-13, a-12) dobj(produced-11, variety-13) case(drugs-15, of-14) nmod:of(variety-13, drugs-15) mark(control-17, to-16) advcl(produced-11, control-17) dobj(control-17, HIV\/AIDS-18) case(patients-20, in-19) nmod:in(control-17, patients-20) acl(patients-20, suffering-21) case(disease-24, with-22) det(disease-24, the-23) nmod:with(suffering-21, disease-24)

ccsasuke commented 8 years ago

@dmcc BTW, I couldn't find a way to re-open this issue and I found this rule: "you cannot re-open your own issues if a repo collaborator closed them"

dmcc commented 8 years ago

@ccsasuke Sorry about the issue opening (didn't know about that rule). Reopening this issue to keep everything in the same place (will close the other one).

dmcc commented 8 years ago

@ccsasuke Thanks for the reports. Please check the latest commits (7474297) and see if they address part i).

For part ii), I'm not sure what's causing it -- there is some code to handle this type of (un)escaping and some unit tests. I added the tree from part ii) into the test suite which tests all combinations of (basic, collapsed) x (Subprocess, JPype) x (SD, UD) and couldn't get the assertion to fail (I get / instead of \/). I also tried r quoting it (r'') instead of normal quoting. Can you post a snippet where you get the error?

ccsasuke commented 8 years ago

@dmcc I confirmed that multiple copies are correctly handled now.

For part ii), I guess the reason is that I was trying to use Sentence.from_stanford_dependencies() method to read the Stanford dependency for a given sentence, and the dependency string also escaped '/'. e.g. dobj(control-17, HIV\/AIDS-18) as in the example I gave. Therefore, the assertion failed since in the code only handles the escaping in the parse tree, but not the dependency string. I guess I can solve this at my end if you don't think it's a bug (by changing '\/' to '/' in dep strings). Adding the following line before Line 224 of CoNLL.py solved this problem for me: line = line.replace(r'\/', '/')

Also, another thing is that after pulling the latest version, Line 242 of CoNLL.py (extra = None) caused a problem when comparing two tokens ('NoneType' object has no attribute 'items' at Line 53 of CoNLL.py) I changed Line 242 to extra = {} and now it works for all my sentences!

Thanks for all the help!

dmcc commented 8 years ago

@ccsasuke Thanks again! I checked in some code which (hopefully) fixes both problems -- please close this issue if it works for you.

ccsasuke commented 8 years ago

@dmcc It works! Thanks a lot!