Noahs-ARK / semafor

http://www.ark.cs.cmu.edu/SEMAFOR
GNU General Public License v3.0
95 stars 47 forks source link

ArrayIndexOutOfBoundsException #5

Closed vanatteveldt closed 10 years ago

vanatteveldt commented 10 years ago

On the current git/master version, I get an ArrayIndexOutOfBoundsException

java.lang.ArrayIndexOutOfBoundsException: 20
    at edu.cmu.cs.lti.ark.util.nlp.parse.DependencyParse.getHeuristicHead(DependencyParse.java:383)

I call it with the corenlp parse of the sentence

If I had to rely on a traditional pension, I would not have had the flexibility that I currently enjoy."

The whole session, files, and stack trace are at https://gist.github.com/vanatteveldt/9180176

vanatteveldt commented 10 years ago

Another sentence that gives the error:

But the Russians only responded and to the surprise of the US and some Western allies with such force that left them with mere verbal warnings and second thoughts about favoring Georgia 's bid for the NATO membership .

https://gist.github.com/vanatteveldt/9185658

sammthomson commented 10 years ago

Thanks for the bug report. I'm swamped for the next couple days, but will take a look ASAP

vanatteveldt commented 10 years ago

I am trying to see if I can fix (or at least workaround) this myself. I printed the parse nodes just before the error occurred for the test file in https://gist.github.com/vanatteveldt/9180176#file-tree-conll

Trying to get tokenNum[0]+1=20
tokenNums=[19]
parseNodes=[0:null_word(^=null, 1:If(^=3:had), 2:I(^=5:rely), 3:had(^=15:had), 4:to(^=5:rely), 5:rely(^=3:had), 6:on(^=0:null_word), 7:a(^=9:pension), 8:traditional(^=9:pension), 9:pension(^=5:rely), 10:,(^=15:had), 11:I(^=15:had), 12:would(^=15:had), 13:not(^=15:had), 14:have(^=15:had), 15:had(^=0:null_word), 22:.(^=15:had), 23:''(^=15:had)]
java.lang.ArrayIndexOutOfBoundsException: 20

The problem seems to be that there is a cycle in the input;

16  the _   DT  DT  _   17  det _   _
17  flexibility _   NN  NN  _   21  dobj    _   _
18  that    _   IN  IN  _   21  mark    _   _
19  I   _   PRP PRP _   21  nsubj   _   _
20  currently   _   RB  RB  _   21  advmod  _   _
21  enjoy   _   VBP VBP _   17  dep _   _

The requested node, 20, is child of 20-21-17-21. So, if the dependency parse is constructed by recursively getting the child nodes from the root, it will not find these nodes, and these nodes are indeed missing from the parseNodes array.

This means that either corenlp is producing illegal conll output (since the dependency tree is in fact not a tree), or that semafor is not parsing the conll correctly.

This is also the case with the other sentence (https://gist.github.com/vanatteveldt/9185658, cycle is 19-21-19). It seems that corenlp produces this output for 'that VP' clauses.

I can work around this in the client by checking and culling cycles (and accepting that I get no frames on them, or submitting them as separate trees, but if it is legal conll to have a multitree then it might be good to change the parsing code as well?

For reference, the code producing the output:

        if(tokenNums.length==1)
        {
            try {
            return parseNodes[tokenNums[0]+1];
            } catch (java.lang.ArrayIndexOutOfBoundsException e) {
            System.err.println("Trying to get tokenNum[0]+1="+(tokenNums[0]+1));
            System.err.println("tokenNums="+Arrays.toString(tokenNums));
            System.err.println("parseNodes="+Arrays.toString(parseNodes));
            throw(e);
            }
        }

(edit: see https://mailman.stanford.edu/pipermail/java-nlp-user/2014-May/005639.html)

vanatteveldt commented 10 years ago

Reply from corenlp:

John Bauer <horatio@gmail.com>
10:18 PM (15 minutes ago) to Wouter, java-nlp-user 
We've made quite a few dependency converter changes since then, and
our current codebase does not seem to have these circular
dependencies.  If you want, it's available in our github account, or
you could wait for the next release, which should be within a month.

https://github.com/stanfordnlp/CoreNLP

John

I'm closing this issue as it is probably a CoreNLP problem rather than semafor's fault.