clulab / processors

Natural Language Processors
https://clulab.github.io/processors/
Apache License 2.0
418 stars 100 forks source link

Integrating SRL #52

Closed aolney closed 8 years ago

aolney commented 8 years ago

I'm in the middle of adding SRL annotations from the LTH SRL parser when I noticed some existing code, particularly Reader.scala in swirl2, that might be suited to this.

Is Reader.scala ready for this? I'm particularly interested in combining this information with the discourse tree.

MihaiSurdeanu commented 8 years ago

Yes, it is. It reads a CoNLL formatted file and converts into a DirectedGraph representation of semantic roles.

I am in the process of adding an in-house SRL package (swirl2), as well. However, the LTH parser is very good, so I would welcome the addition.

aolney commented 8 years ago

There seems to be a discrepancy between the output LTH produces and Reader.scala expects. Specifically there is an assert that the number of columns >= 14, but LTH can be less.

Here's an example with one predicate; LTH has 12 columns:

1   It  _   _   PRP It  _   PRP 2   SBJ _   A1
2   is  be  _   VBZ is  be  VBZ 0   ROOT    _   _
3   hollow  hollow  _   JJ  hollow  hollow  JJ  2   PRD _   _
4   and _   _   CC  and _   CC  2   COORD   _   _
5   can can _   MD  can can MD  4   CONJ    _   AM-MOD
6   fill    fill    _   VB  fill    fill    VB  5   VC  fill.01 _
7   up  _   _   RP  up  _   RP  6   PRT _   _
8   with    _   _   IN  with    _   IN  6   ADV _   A2
9   blood   blood   _   NN  blood   blood   NN  8   PMOD    _   _
10  .   _   _   .   .   _   .   2   P   _   _

Here's an example with no predicates (according to LTH); it has 11 columns:

1   The _   _   DT  The _   DT  2   NMOD    _
2   heart   heart   _   NN  heart   heart   NN  3   SBJ _
3   is  be  _   VBZ is  be  VBZ 0   ROOT    _
4   a   _   _   DT  a   _   DT  5   NMOD    _
5   pump    pump    _   NN  pump    pump    NN  3   PRD _
6   .   _   _   .   .   _   .   3   P   _
aolney commented 8 years ago

I can confirm that tweaking some of the indices solved the problem. Below is some code that shows this. Basically I tweaked the Reader class so that SRL could be added to the doc created by an existing processor.

The way I'm calling the code is like this:

// any processor works here! Try FastNLPProcessor or BioNLPProcessor.
val proc:Processor = new CoreNLPProcessor(withDiscourse = true)

// the actual work is done here
val doc = proc.annotate(inputText)

//new SRL annotator (basically Reader)
val srlAnnotator = new SRLAnnotator
val srlDoc = srlAnnotator.readCoNLL("/z/aolney/repos/lth_srl/lth_parse.txt", doc, verbose = true)

Here's the SRLAnnotator class:

import java.io.{FileReader, BufferedReader, PrintWriter, File}

import edu.arizona.sista.processors.fastnlp.FastNLPProcessor
import edu.arizona.sista.processors.{DependencyMap, DocumentSerializer, Document, Processor}
import edu.arizona.sista.struct.DirectedGraph
import org.slf4j.LoggerFactory

import scala.collection.mutable
import scala.collection.mutable.{ListBuffer, ArrayBuffer}
import scala.io.Source

//import Reader._

/**
 * Simplified the CoNLL reader -- aolney
 * Reads a CoNLL formatted file and converts it to our own representation
 * User: mihais
 * Date: 5/5/15
 */
class SRLAnnotator {
  class CoNLLToken(val word:String, val pos:String, val pred:Int, val frameBits:Array[String]) {
    override def toString:String = word + "/" + pos + "/" + pred  }

  val logger = LoggerFactory.getLogger(classOf[SRLAnnotator])

  val USE_CONLL_TOKENIZATION = true
  val USE_GOLD_SYNTAX = true

  var argConflictCount = 0
  var multiPredCount = 0
  var argCount = 0
  var predCount = 0

  def readCoNLL(filePath:String, document:Document, verbose:Boolean = false):Document = {
    val file = new File(filePath)
    val source = Source.fromFile(file)
    val sentences = new ArrayBuffer[Array[CoNLLToken]]
    var sentence = new ArrayBuffer[CoNLLToken]

    argConflictCount = 0
    multiPredCount = 0
    argCount = 0
    predCount = 0
    var tokenCount = 0
    var sentCount = 0
    var hyphCount = 0

    //
    // read all sentences
    // also, collapse hyphenated phrases, which were brutally tokenized in CoNLL
    //
    for(l <- source.getLines()) {
      val line = l.trim
      if(line.length > 0) {
        val bits = l.split("\\t")
        assert(bits.size >= 11) // >=14
        val token = mkToken(bits)
        sentence += token
        tokenCount += 1
        if(token.pos == "HYPH") hyphCount += 1
      } else {
        // end of sentence
        sentences += collapseHyphens(sentence.toArray, verbose)
        sentence = new ArrayBuffer[CoNLLToken]()
        sentCount += 1
      }
    }
    source.close()
    logger.debug(s"Read $tokenCount tokens, grouped in $sentCount sentences.")
    logger.debug(s"Found $hyphCount hyphens.")
    logger.debug(s"In hyphenated phrases, found $multiPredCount multi predicates and $argConflictCount argument conflicts.")

    //
    // construct the semantic roles from CoNLL tokens
    //
    val semDependencies = new ArrayBuffer[DirectedGraph[String]]()
    for(sent <- sentences) {
      semDependencies += mkSemanticDependencies(sent)
    }

    //
    // Using the doc passed in, assign the semantic roles to sentence
    //
    assert(document.sentences.length == semDependencies.size)
    for(i <- 0 until document.sentences.length) {
      document.sentences(i).setDependencies(DependencyMap.SEMANTIC_ROLES, semDependencies(i))
    }

    logger.debug(s"Found a total of $predCount predicates with $argCount arguments.")

    document
  }

  def mkSemanticDependencies(sentence:Array[CoNLLToken]):DirectedGraph[String] = {
    val edges = new ListBuffer[(Int, Int, String)]
    val heads = new mutable.HashSet[Int]()
    val modifiers = new mutable.HashSet[Int]()

    var columnOffset = -1
    for(p <- 0 until sentence.length) {
      if(sentence(p).pred > 0) { // found a head
        val head = p
        heads += head
        predCount += 1
        columnOffset += sentence(p).pred // in case of multiple predicates squished in one token, use the last
        for(i <- 0 until sentence.length) {
          if(sentence(i).frameBits(columnOffset) != "_") {
            val modifier = i
            val label = sentence(i).frameBits(columnOffset)
            edges += new Tuple3(head, modifier, label)
            modifiers += modifier
            argCount += 1
          }
        }
      }
    }

    val roots = new mutable.HashSet[Int]()
    for(h <- heads) {
      if(! modifiers.contains(h)) {
        roots += h
      }
    }

    new DirectedGraph[String](edges.toList, roots.toSet)
  }

  def mkToken(bits:Array[String]):CoNLLToken = {
    val word = bits(1)
    val pos = bits(3) //4
    val isPred = bits(10) match { //13
      case "_" => 0
      case _ => 1
    }
    val frameBits =  bits.slice(11, bits.length)//14
    new CoNLLToken(word, pos, isPred, frameBits)
  }

  /**
   * Merges tokens that were separated around dashes in CoNLL, to bring tokenization closer to the usual Treebank one
   * We need this because most parsers behave horribly if hyphenated words are tokenized around dashes
   */
  def collapseHyphens(origSentence:Array[CoNLLToken], verbose:Boolean):Array[CoNLLToken] = {
    val sent = new ArrayBuffer[CoNLLToken]()

    var start = 0
    while(start < origSentence.length) {
      val end = findEnd(origSentence, start)
      if(end > start + 1) {
        val token = mergeTokens(origSentence, start, end, verbose)
        sent += token
      } else {
        sent += origSentence(start)
      }
      start = end
    }

    sent.toArray
  }

  def findEnd(sent:Array[CoNLLToken], start:Int):Int = {
    var end = start + 1
    while(end < sent.length) {
      if(sent(end).pos != "HYPH") return end
      else end = end + 2
    }
    sent.length
  }

  def mergeTokens(sent:Array[CoNLLToken], start:Int, end:Int, verbose:Boolean):CoNLLToken = {
    val phrase = sent.slice(start, end)
    val word = phrase.map(_.word).mkString("")
    val pos = phrase.last.pos // this one doesn't really matter; we retag the entire data with our Processor anyway...
    val pred = mergePredicates(phrase, verbose)
    val frameBits = mergeFrames(phrase, verbose)

    if(verbose) {
      //logger.debug("Merging tokens: " + phrase.mkString(" ") + " as: " + word + "/" + isPred)
    }

    new CoNLLToken(word, pos, pred, frameBits)
  }

  def mergePredicates(phrase:Array[CoNLLToken], verbose:Boolean):Int = {
    val l = phrase.map(_.pred).sum

    if(l > 0) {
      if(l > 1) {
        if(verbose) logger.debug("Found MULTI PREDICATE in hyphenated phrase: " + phrase.mkString(" "))
        multiPredCount += 1
      }
      if(verbose) {
        // logger.info("Found hyphenated predicate: " + phrase.mkString(" "))
      }
    }

    l
  }

  def mergeFrames(phrase:Array[CoNLLToken], verbose:Boolean):Array[String] = {
    val frameBits = new Array[String](phrase(0).frameBits.length)
    for(i <- 0 until frameBits.length) {
      frameBits(i) = mergeFrame(phrase, i, verbose)
    }
    frameBits
  }

  def mergeFrame(phrase:Array[CoNLLToken], position:Int, verbose:Boolean):String = {
    // pick the right-most argument assignment
    // for example, if the tokens have: "A1 _ A0" we would pick A0
    // of course, the above scenario is HIGHLY unlikely. normally, there will be a single argument, e.g.: "_ _ A0"

    var arg = "_"
    var count = 0
    for(i <- phrase.length - 1 to 0 by -1) {
      if(phrase(i).frameBits(position) != "_") {
        if(arg == "_") arg = phrase(i).frameBits(position)
        count += 1
      }
    }
    if(count > 1) {
      if(verbose) logger.debug("Found ARGUMENT CONFLICT " + phrase.map(_.frameBits(position)).mkString(" ") + " in hyphenated phrase: " + phrase.mkString(" "))
      argConflictCount += 1
    }

    arg
  }
}
MihaiSurdeanu commented 8 years ago

Sounds almost good. Additionally, you'll have to make sure that the tokenization in the CoNLL frames aligns with the tokenization in the corresponding Processor. In general, it does not. For example, for CoNLL, we tokenized around dashes, so "student-taught course" is actually 4 tokens. The CoreNLP breaks this into 2 tokens.

For the record, the Reader class was designed for the CoNLL 2008 format. I think that is nearly identical to the 2009 format for English. Here's the first sentence in the training corpus in the 2008 format:

1 In in in IN IN 43 20 LOC ADV AM-LOC 2 an an an DT DT 5 5 NMOD NMOD 3 Oct. oct. oct. NNP NNP 4 4 NMOD NMOD 4 19 19 19 CD CD 5 5 NMOD NMOD AM-TMP 5 review review review NN NN 1 1 PMOD PMOD Y review.01 6 of of of IN IN 5 5 NMOD NMOD A1 7 `` ` 9 6 P P 8 The the the DT DT 9 9 NMOD NMOD 9 Misanthrope misanthrope misanthrope NN NN 6 6 PMOD PMOD 10 '' '' '' '' '' 9 5 P P 11 at at at IN IN 9 5 LOC LOC 12 Chicago chicago chicago NNP NNP 15 15 NMOD NMOD 13 's 's 's POS POS 12 12 SUFFIX SUFFIX 14 Goodman goodman goodman NNP NNP 15 15 NAME NAME 15 Theatre theatre theatre NNP NNP 11 11 PMOD PMOD 16 ( -lrb- -lrb- ( ( 20 20 P P 17` `` `` 20 19 P P 18 Revitalized revitalize revitalize VBN VBN 19 19 NMOD NMOD Y revitalize.01 19 Classics classics classics NNS NNS 20 20 SBJ SBJ A1 A0 A1 20 Take take take VBP VB 5 43 PRN OBJ Y take.01 21 the the the DT DT 22 22 NMOD NMOD 22 Stage stage stage NN NNP 20 20 OBJ OBJ Y stage.02 A1 23 in in in IN IN 20 22 LOC LOC AM-LOC 24 Windy windy windy NNP NNP 25 25 NAME NAME 25 City city city NNP NNP 23 23 PMOD PMOD 26 , , , , , 20 43 P P 27 '' '' '' '' '' 20 43 P P 28 Leisure leisure leisure NNP NNP 30 30 NAME NAME 29 & & & CC CC 30 30 NAME NAME 30 Arts arts arts NNS NNS 20 34 TMP NMOD 31 ) -rrb- -rrb- ) ) 20 30 P P 32 , , , , , 43 34 P P 33 the the the DT DT 34 34 NMOD NMOD 34 role role role NN NN 43 43 SBJ SBJ Y role.01 A1 A1 35 of of of IN IN 34 34 NMOD NMOD A1 36 Celimene celimene celimene NNP NNP 35 35 PMOD PMOD 37 , , , , , 34 34 P P 38 played play play VBN VBN 34 34 APPO APPO Y play.02 39 by by by IN IN 38 38 LGS LGS A0 40 Kim kim kim NNP NNP 41 41 NAME NAME 41 Cattrall cattrall cattrall NNP NNP 39 39 PMOD PMOD A0 42 , , , , , 34 34 P P 43 was be be VBD VBD 0 0 ROOT ROOT 44 mistakenly mistakenly mistakenly RB RB 45 45 MNR AMOD AM-MNR 45 attributed attribute attribute VBN VBN 43 43 VC PRD Y attribute.01 46 to to to TO TO 45 45 ADV AMOD A2 47 Christina christina christina NNP NNP 48 48 NAME NAME 48 Haag haag haag NNP NNP 46 46 PMOD PMOD 49 . . . . . 43 43 P P

aolney commented 8 years ago

Thanks :) There are 3 versions of LTH code I think:

I'm using the 2008, and it seems to be working. I will closely check the hyphenation issue.

aolney commented 8 years ago

I'm specifically looking at

https://github.com/microth/mateplus

at this point

MihaiSurdeanu commented 8 years ago

Hi Andrew, Mate+ is a solid tool. It would be great to have it seamlessly integrated in processors. But, at this point, we have limited resources for the job. Maybe I'll get a MS students to help, but it's not guaranteed. If you have already integrated it, and are willing to contribute it, it would be awesome. Mihai

aolney commented 8 years ago

Happy to contribute. I'll do a fork and submit a PR when I get something working

MihaiSurdeanu commented 8 years ago

Thanks!

On Thu, Jul 7, 2016 at 6:10 PM, Andrew M Olney notifications@github.com wrote:

Happy to contribute. I'll do a fork and submit a PR when I get something working

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/clulab/processors/issues/52#issuecomment-231107572, or mute the thread https://github.com/notifications/unsubscribe/ABH-zmg6nfM-i-SduJO7tgxGLoLvB_96ks5qTRbXgaJpZM4ID5I0 .

aolney commented 7 years ago

@CraigKelly has created a wrapper for MATE+

https://github.com/CraigKelly/mateplus-poc

CraigKelly commented 7 years ago

@aolney, Apologies for the confusion - the repo's new, actual home has moved to an organizational account:

https://github.com/memphis-iis/mateplus-poc