clulab / processors

Natural Language Processors
https://clulab.github.io/processors/
Apache License 2.0
418 stars 101 forks source link

CluProcessor.parseSentenceWithEisner seems to be crashing #652

Closed kwalcock closed 2 years ago

kwalcock commented 2 years ago

This seems to crash:

    Utils.initializeDyNet()
    val proc:Processor = new CluProcessor()

    val doc = proc.annotate("contribution , by delegation , to the promotion of the season dry cold ( ssf ) 2019/20 contribution , by speculation , to the enhancement of the ssf 2019/20 podor 27% matam 5% bakel 2% dagana 18% apple earth 10% okra 5% yam 7% but 4% others 26% tomato 10% onion 38% guiers lake 48% valley : 20,992 ha saed/ddac/dsg/bdd , cold dry season 19/20 valley : 20,992 ha saed/ddac/dsg/bdd , cold dry season 2019_2020 1 others : cassava , melon , peanuts , eggplant , watermelon , chili , carrot , cabbage , jaxatu , beans , bissap , banana plantation , fruit trees .")
kwalcock commented 2 years ago

Specifically

[error] java.lang.ArrayIndexOutOfBoundsException: 185
[error]         at org.clulab.dynet.Eisner.$anonfun$generateOutput$2(Eisner.scala:248)
[error]         at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
[error]         at org.clulab.dynet.Eisner.generateOutput(Eisner.scala:244)
[error]         at org.clulab.dynet.Eisner.ensembleParser(Eisner.scala:395)
[error]         at org.clulab.processors.clu.CluProcessor.parseSentenceWithEisner(CluProcessor.scala:458)
[error]         at org.clulab.processors.clu.CluProcessor.$anonfun$parse$1(CluProcessor.scala:773)
[error]         at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
[error]         at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
[error]         at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
[error]         at org.clulab.processors.clu.CluProcessor.parse(CluProcessor.scala:772)
[error]         at org.clulab.processors.clu.CluProcessor.$anonfun$annotate$1(CluProcessor.scala:234)
[error]         at org.clulab.utils.BeforeAndAfter.perform(BeforeAndAfter.scala:10)
[error]         at org.clulab.utils.BeforeAndAfter.perform$(BeforeAndAfter.scala:7)
[error]         at org.clulab.processors.clu.GivenConstEmbeddingsAttachment.perform(CluProcessor.scala:1009)
[error]         at org.clulab.processors.clu.CluProcessor.annotate(CluProcessor.scala:226)
[error]         at org.clulab.processors.Processor.annotate(Processor.scala:65)
[error]         at org.clulab.processors.Processor.annotate$(Processor.scala:62)
[error]         at org.clulab.processors.clu.CluProcessor.annotate(CluProcessor.scala:222)
[error]         at org.clulab.processors.examples.ProcessorExample$.main(ProcessorExample.scala:35)
[error]         at org.clulab.processors.examples.ProcessorExample.main(ProcessorExample.scala)
[error]         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error]         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error]         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error]         at java.lang.reflect.Method.invoke(Method.java:498)
kwalcock commented 2 years ago

In this code in Eisner.generateOutput

      for(i <- scores.indices) {
        val relativeHead = scores(i).maxBy(_._2)._1.toInt
        val depMod = i + 1
        val depHead = if (relativeHead == 0) 0 else depMod + relativeHead
        if (depMod >= dependencies.length || depHead >= dependencies(depMod).length) // added
          println("It is out of bounds!") // added
        val label = dependencies(depMod)(depHead).label

I'm not sure there is any guarantee that the depMod + relativeHead will be within bounds. It might be that the relativeHead needs to be chosen not just with maxBy, but such that ._1 will result in a depHead value within bounds.

MihaiSurdeanu commented 2 years ago

Thanks Keith! I'll take a look.

On Tue, Jul 5, 2022 at 22:02 Keith Alcock @.***> wrote:

In this code in Eisner.generateOutput

  for(i <- scores.indices) {
    val relativeHead = scores(i).maxBy(_._2)._1.toInt
    val depMod = i + 1
    val depHead = if (relativeHead == 0) 0 else depMod + relativeHead
    if (depMod >= dependencies.length || depHead >= dependencies(depMod).length) // added
      println("It is out of bounds!") // added
    val label = dependencies(depMod)(depHead).label

I'm not sure there is any guarantee that the depMod + relativeHead will be within bounds. It might be that the relativeHead needs to be chosen not just with maxBy, but such that ._1 will result in a depHead value within bounds.

— Reply to this email directly, view it on GitHub https://github.com/clulab/processors/issues/652#issuecomment-1175401009, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TWM66T3L7IVJ6I73U3VSSBFHANCNFSM5Z5HUS7A . You are receiving this because you are subscribed to this thread.Message ID: @.***>

MihaiSurdeanu commented 2 years ago

I think you're right! There is a check for these bounds in the part of the code that handles the common algorithm, but not in this error handling case...

I think this in line:

val depHead = if (relativeHead == 0) 0 else depMod + relativeHead

depHead should be bound to be between [0, sentence length) Thanks!

kwalcock commented 2 years ago

There is code in Eisner.scala that handles the eventuality of the index being out of range:

https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/dynet/Eisner.scala#L282

kwalcock commented 2 years ago

The version in toDependencyTable does not necessarily fill in every value in the Array that it returns, the Array[Array[Dependency]]. The assignment dependencies(mod)(head) = Dependency(mod, head, score, j) is only performed conditionally. That seems to be accounted for later in ensembleParser by val dep = startingDependencies(i)(j); if(dep != null) {. However, it does not seem to be checked consistently, because later there is

        startingDependencies(mod + 1)(head + 1).score =
          lambda * startingDependencies(mod + 1)(head + 1).score + (1 - lambda) * topLabelAndScore._2

that might result in null.score (AFAICT).

This generateOutput is also not checking for incoming nulls, but probably should. It returns an IndexedSeq[(Int, String)]. I think there should be no nulls there because the output goes to parserPostProcessing where it does _._1 and _._2 without checking for null. To prevent the nulls, I need to fill in a value, even when new code like this runs:

          val depHead =
              if (dependencies(depMod).indices.contains(oldDepHead)) oldDepHead
              else 0 // There should be a better value

0 is not a good answer, because later

val label = dependencies(depMod)(depHead).label

and the dependency there could be null (and is). What value (for head) and label should be used?

Some cyclomatic complexity calculator would flag this code as overly complex, but I'm not in a good position to change that.

MihaiSurdeanu commented 2 years ago

Thanks @kwalcock !

On the first issue:

 startingDependencies(mod + 1)(head + 1).score =
          lambda * startingDependencies(mod + 1)(head + 1).score + (1 - lambda) * topLabelAndScore._2

This is fine. The parent for loop goes over labelTopScores, which are aligned with startingDependencies. That is, we will never "hit" a null cell in this loop.

On the second issue: yes, in the else block ("Eisner failed to produce a complete tree; revert to the greedy inference") we do need to check for nulls. In case a null is present, we need to fill in some value. I propose to use: 0/"root" - for the if(generateRelativeHeads) condition, and -1/"root" - for the else condition. That is, all these words will be headed by the virtual root token in the sentence.

kwalcock commented 2 years ago

Addressed with #654.