Closed aolney closed 8 years ago
Yes, it is. It reads a CoNLL formatted file and converts into a DirectedGraph representation of semantic roles.
I am in the process of adding an in-house SRL package (swirl2), as well. However, the LTH parser is very good, so I would welcome the addition.
There seems to be a discrepancy between the output LTH produces and Reader.scala expects. Specifically there is an assert that the number of columns >= 14, but LTH can be less.
Here's an example with one predicate; LTH has 12 columns:
1 It _ _ PRP It _ PRP 2 SBJ _ A1
2 is be _ VBZ is be VBZ 0 ROOT _ _
3 hollow hollow _ JJ hollow hollow JJ 2 PRD _ _
4 and _ _ CC and _ CC 2 COORD _ _
5 can can _ MD can can MD 4 CONJ _ AM-MOD
6 fill fill _ VB fill fill VB 5 VC fill.01 _
7 up _ _ RP up _ RP 6 PRT _ _
8 with _ _ IN with _ IN 6 ADV _ A2
9 blood blood _ NN blood blood NN 8 PMOD _ _
10 . _ _ . . _ . 2 P _ _
Here's an example with no predicates (according to LTH); it has 11 columns:
1 The _ _ DT The _ DT 2 NMOD _
2 heart heart _ NN heart heart NN 3 SBJ _
3 is be _ VBZ is be VBZ 0 ROOT _
4 a _ _ DT a _ DT 5 NMOD _
5 pump pump _ NN pump pump NN 3 PRD _
6 . _ _ . . _ . 3 P _
I can confirm that tweaking some of the indices solved the problem. Below is some code that shows this. Basically I tweaked the Reader class so that SRL could be added to the doc created by an existing processor.
The way I'm calling the code is like this:
// any processor works here! Try FastNLPProcessor or BioNLPProcessor.
val proc:Processor = new CoreNLPProcessor(withDiscourse = true)
// the actual work is done here
val doc = proc.annotate(inputText)
//new SRL annotator (basically Reader)
val srlAnnotator = new SRLAnnotator
val srlDoc = srlAnnotator.readCoNLL("/z/aolney/repos/lth_srl/lth_parse.txt", doc, verbose = true)
Here's the SRLAnnotator class:
import java.io.{FileReader, BufferedReader, PrintWriter, File}
import edu.arizona.sista.processors.fastnlp.FastNLPProcessor
import edu.arizona.sista.processors.{DependencyMap, DocumentSerializer, Document, Processor}
import edu.arizona.sista.struct.DirectedGraph
import org.slf4j.LoggerFactory
import scala.collection.mutable
import scala.collection.mutable.{ListBuffer, ArrayBuffer}
import scala.io.Source
//import Reader._
/**
* Simplified the CoNLL reader -- aolney
* Reads a CoNLL formatted file and converts it to our own representation
* User: mihais
* Date: 5/5/15
*/
class SRLAnnotator {
class CoNLLToken(val word:String, val pos:String, val pred:Int, val frameBits:Array[String]) {
override def toString:String = word + "/" + pos + "/" + pred }
val logger = LoggerFactory.getLogger(classOf[SRLAnnotator])
val USE_CONLL_TOKENIZATION = true
val USE_GOLD_SYNTAX = true
var argConflictCount = 0
var multiPredCount = 0
var argCount = 0
var predCount = 0
def readCoNLL(filePath:String, document:Document, verbose:Boolean = false):Document = {
val file = new File(filePath)
val source = Source.fromFile(file)
val sentences = new ArrayBuffer[Array[CoNLLToken]]
var sentence = new ArrayBuffer[CoNLLToken]
argConflictCount = 0
multiPredCount = 0
argCount = 0
predCount = 0
var tokenCount = 0
var sentCount = 0
var hyphCount = 0
//
// read all sentences
// also, collapse hyphenated phrases, which were brutally tokenized in CoNLL
//
for(l <- source.getLines()) {
val line = l.trim
if(line.length > 0) {
val bits = l.split("\\t")
assert(bits.size >= 11) // >=14
val token = mkToken(bits)
sentence += token
tokenCount += 1
if(token.pos == "HYPH") hyphCount += 1
} else {
// end of sentence
sentences += collapseHyphens(sentence.toArray, verbose)
sentence = new ArrayBuffer[CoNLLToken]()
sentCount += 1
}
}
source.close()
logger.debug(s"Read $tokenCount tokens, grouped in $sentCount sentences.")
logger.debug(s"Found $hyphCount hyphens.")
logger.debug(s"In hyphenated phrases, found $multiPredCount multi predicates and $argConflictCount argument conflicts.")
//
// construct the semantic roles from CoNLL tokens
//
val semDependencies = new ArrayBuffer[DirectedGraph[String]]()
for(sent <- sentences) {
semDependencies += mkSemanticDependencies(sent)
}
//
// Using the doc passed in, assign the semantic roles to sentence
//
assert(document.sentences.length == semDependencies.size)
for(i <- 0 until document.sentences.length) {
document.sentences(i).setDependencies(DependencyMap.SEMANTIC_ROLES, semDependencies(i))
}
logger.debug(s"Found a total of $predCount predicates with $argCount arguments.")
document
}
def mkSemanticDependencies(sentence:Array[CoNLLToken]):DirectedGraph[String] = {
val edges = new ListBuffer[(Int, Int, String)]
val heads = new mutable.HashSet[Int]()
val modifiers = new mutable.HashSet[Int]()
var columnOffset = -1
for(p <- 0 until sentence.length) {
if(sentence(p).pred > 0) { // found a head
val head = p
heads += head
predCount += 1
columnOffset += sentence(p).pred // in case of multiple predicates squished in one token, use the last
for(i <- 0 until sentence.length) {
if(sentence(i).frameBits(columnOffset) != "_") {
val modifier = i
val label = sentence(i).frameBits(columnOffset)
edges += new Tuple3(head, modifier, label)
modifiers += modifier
argCount += 1
}
}
}
}
val roots = new mutable.HashSet[Int]()
for(h <- heads) {
if(! modifiers.contains(h)) {
roots += h
}
}
new DirectedGraph[String](edges.toList, roots.toSet)
}
def mkToken(bits:Array[String]):CoNLLToken = {
val word = bits(1)
val pos = bits(3) //4
val isPred = bits(10) match { //13
case "_" => 0
case _ => 1
}
val frameBits = bits.slice(11, bits.length)//14
new CoNLLToken(word, pos, isPred, frameBits)
}
/**
* Merges tokens that were separated around dashes in CoNLL, to bring tokenization closer to the usual Treebank one
* We need this because most parsers behave horribly if hyphenated words are tokenized around dashes
*/
def collapseHyphens(origSentence:Array[CoNLLToken], verbose:Boolean):Array[CoNLLToken] = {
val sent = new ArrayBuffer[CoNLLToken]()
var start = 0
while(start < origSentence.length) {
val end = findEnd(origSentence, start)
if(end > start + 1) {
val token = mergeTokens(origSentence, start, end, verbose)
sent += token
} else {
sent += origSentence(start)
}
start = end
}
sent.toArray
}
def findEnd(sent:Array[CoNLLToken], start:Int):Int = {
var end = start + 1
while(end < sent.length) {
if(sent(end).pos != "HYPH") return end
else end = end + 2
}
sent.length
}
def mergeTokens(sent:Array[CoNLLToken], start:Int, end:Int, verbose:Boolean):CoNLLToken = {
val phrase = sent.slice(start, end)
val word = phrase.map(_.word).mkString("")
val pos = phrase.last.pos // this one doesn't really matter; we retag the entire data with our Processor anyway...
val pred = mergePredicates(phrase, verbose)
val frameBits = mergeFrames(phrase, verbose)
if(verbose) {
//logger.debug("Merging tokens: " + phrase.mkString(" ") + " as: " + word + "/" + isPred)
}
new CoNLLToken(word, pos, pred, frameBits)
}
def mergePredicates(phrase:Array[CoNLLToken], verbose:Boolean):Int = {
val l = phrase.map(_.pred).sum
if(l > 0) {
if(l > 1) {
if(verbose) logger.debug("Found MULTI PREDICATE in hyphenated phrase: " + phrase.mkString(" "))
multiPredCount += 1
}
if(verbose) {
// logger.info("Found hyphenated predicate: " + phrase.mkString(" "))
}
}
l
}
def mergeFrames(phrase:Array[CoNLLToken], verbose:Boolean):Array[String] = {
val frameBits = new Array[String](phrase(0).frameBits.length)
for(i <- 0 until frameBits.length) {
frameBits(i) = mergeFrame(phrase, i, verbose)
}
frameBits
}
def mergeFrame(phrase:Array[CoNLLToken], position:Int, verbose:Boolean):String = {
// pick the right-most argument assignment
// for example, if the tokens have: "A1 _ A0" we would pick A0
// of course, the above scenario is HIGHLY unlikely. normally, there will be a single argument, e.g.: "_ _ A0"
var arg = "_"
var count = 0
for(i <- phrase.length - 1 to 0 by -1) {
if(phrase(i).frameBits(position) != "_") {
if(arg == "_") arg = phrase(i).frameBits(position)
count += 1
}
}
if(count > 1) {
if(verbose) logger.debug("Found ARGUMENT CONFLICT " + phrase.map(_.frameBits(position)).mkString(" ") + " in hyphenated phrase: " + phrase.mkString(" "))
argConflictCount += 1
}
arg
}
}
Sounds almost good. Additionally, you'll have to make sure that the tokenization in the CoNLL frames aligns with the tokenization in the corresponding Processor. In general, it does not. For example, for CoNLL, we tokenized around dashes, so "student-taught course" is actually 4 tokens. The CoreNLP breaks this into 2 tokens.
For the record, the Reader class was designed for the CoNLL 2008 format. I think that is nearly identical to the 2009 format for English. Here's the first sentence in the training corpus in the 2008 format:
1 In in in IN IN 43 20 LOC ADV AM-LOC
2 an an an DT DT 5 5 NMOD NMOD
3 Oct. oct. oct. NNP NNP 4 4 NMOD NMOD
4 19 19 19 CD CD 5 5 NMOD NMOD AM-TMP
5 review review review NN NN 1 1 PMOD PMOD Y review.01
6 of of of IN IN 5 5 NMOD NMOD A1
7 ``
`
9 6 P P
8 The the the DT DT 9 9 NMOD NMOD
9 Misanthrope misanthrope misanthrope NN NN 6 6 PMOD PMOD
10 '' '' '' '' '' 9 5 P P
11 at at at IN IN 9 5 LOC LOC
12 Chicago chicago chicago NNP NNP 15 15 NMOD NMOD
13 's 's 's POS POS 12 12 SUFFIX SUFFIX
14 Goodman goodman goodman NNP NNP 15 15 NAME NAME
15 Theatre theatre theatre NNP NNP 11 11 PMOD PMOD
16 ( -lrb- -lrb- ( ( 20 20 P P
17` ``
`` 20 19 P P
18 Revitalized revitalize revitalize VBN VBN 19 19 NMOD NMOD Y revitalize.01
19 Classics classics classics NNS NNS 20 20 SBJ SBJ A1 A0 A1
20 Take take take VBP VB 5 43 PRN OBJ Y take.01
21 the the the DT DT 22 22 NMOD NMOD
22 Stage stage stage NN NNP 20 20 OBJ OBJ Y stage.02 A1
23 in in in IN IN 20 22 LOC LOC AM-LOC
24 Windy windy windy NNP NNP 25 25 NAME NAME
25 City city city NNP NNP 23 23 PMOD PMOD
26 , , , , , 20 43 P P
27 '' '' '' '' '' 20 43 P P
28 Leisure leisure leisure NNP NNP 30 30 NAME NAME
29 & & & CC CC 30 30 NAME NAME
30 Arts arts arts NNS NNS 20 34 TMP NMOD
31 ) -rrb- -rrb- ) ) 20 30 P P
32 , , , , , 43 34 P P
33 the the the DT DT 34 34 NMOD NMOD
34 role role role NN NN 43 43 SBJ SBJ Y role.01 A1 A1
35 of of of IN IN 34 34 NMOD NMOD A1
36 Celimene celimene celimene NNP NNP 35 35 PMOD PMOD
37 , , , , , 34 34 P P
38 played play play VBN VBN 34 34 APPO APPO Y play.02
39 by by by IN IN 38 38 LGS LGS A0
40 Kim kim kim NNP NNP 41 41 NAME NAME
41 Cattrall cattrall cattrall NNP NNP 39 39 PMOD PMOD A0
42 , , , , , 34 34 P P
43 was be be VBD VBD 0 0 ROOT ROOT
44 mistakenly mistakenly mistakenly RB RB 45 45 MNR AMOD AM-MNR
45 attributed attribute attribute VBN VBN 43 43 VC PRD Y attribute.01
46 to to to TO TO 45 45 ADV AMOD A2
47 Christina christina christina NNP NNP 48 48 NAME NAME
48 Haag haag haag NNP NNP 46 46 PMOD PMOD
49 . . . . . 43 43 P P
Thanks :) There are 3 versions of LTH code I think:
I'm using the 2008, and it seems to be working. I will closely check the hyphenation issue.
Hi Andrew, Mate+ is a solid tool. It would be great to have it seamlessly integrated in processors. But, at this point, we have limited resources for the job. Maybe I'll get a MS students to help, but it's not guaranteed. If you have already integrated it, and are willing to contribute it, it would be awesome. Mihai
Happy to contribute. I'll do a fork and submit a PR when I get something working
Thanks!
On Thu, Jul 7, 2016 at 6:10 PM, Andrew M Olney notifications@github.com wrote:
Happy to contribute. I'll do a fork and submit a PR when I get something working
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/clulab/processors/issues/52#issuecomment-231107572, or mute the thread https://github.com/notifications/unsubscribe/ABH-zmg6nfM-i-SduJO7tgxGLoLvB_96ks5qTRbXgaJpZM4ID5I0 .
@CraigKelly has created a wrapper for MATE+
@aolney, Apologies for the confusion - the repo's new, actual home has moved to an organizational account:
I'm in the middle of adding SRL annotations from the LTH SRL parser when I noticed some existing code, particularly Reader.scala in swirl2, that might be suited to this.
Is Reader.scala ready for this? I'm particularly interested in combining this information with the discourse tree.