Noahs-ARK / semafor

http://www.ark.cs.cmu.edu/SEMAFOR
GNU General Public License v3.0
96 stars 46 forks source link

Notice: SEMAFOR is no longer being maintained. Please see open-SESAME for a newer, more accurate frame-semantic parsing system.

SEMAFOR (a frame-semantic parser for English)
Copyright (C) 2012
Dipanjan Das, Andre Martins, Nathan Schneider, Desai Chen, Sam Thomson &
Noah A. Smith
Language Technologies Institute, Carnegie Mellon University
<http://www.ark.cs.cmu.edu/SEMAFOR>

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

SEMAFOR: Semantic Analysis of Frame Representations

SEMAFOR is a tool for automatic analysis of the frame-semantic structure of English text.

FrameNet is a lexical resource that groups predicates in a hierarchy of structured concepts, known as frames. Each frame in the lexicon in turn defines several named roles corresponding to aspects of that concept (e.g. participants in an event).

This tool attempts to find which words in text evoke which semantic frames, and to find and label each frame's arguments - portions of the sentence that fill a role associated with the frame. It takes as input a file with English sentences, one per line, and performs the following steps:

  1. Preprocessing The sentences are lemmatized, part-of-speech tagged, and syntactically parsed.

  2. Target identification Frame-evoking words and phrases ("targets") are heuristically identified in each sentence.

  3. Frame identification A log-linear model, trained on FrameNet 1.5 data with full-text frame annotations, produces for each target a probability distribution over frames in the FrameNet lexicon (optionally constrained by a semi-supervised filter). The target is then labeled with the highest-scoring frame.

  4. Argument identification A second log-linear model, trained on the same data, considers every role of each labeled frame instance and identifies a span of words in the sentence - or NULL - as filling that role. A subsequent step ensures that none of a frame's overt arguments overlap using beam search; an alternate strategy using AD^3 (or Alternating Directions Dual Decomposition) uses two other constraints used in FrameNet for argument identification.

  5. Output An XML or JSON file is produced containing the text of the input sentences, augmented with the frame-semantic information (target-frame and argument-role pairings) predicted by the system. See the papers listed below ("Further Reading") for algorithmic details and experimental evaluation of the components of this system.

What follows is an overview of the organization of SEMAFOR's directory structure, and how it can be installed and run on new data.

Requirements

Running the SEMAFOR tool requires Java 1.7. It should run on any platform (Windows, Unix, or Mac OS).

Contents

Underneath the root folder, there are the following files and folders:

bin/
Executables for running semafor
lib/
Java libraries required for this project, as detailed below
scripts/
Executables required for preprocessing raw text and evaluating the performance of SEMAFOR
src/
Source files of the SEMAFOR project
training/
Scripts and data used for training the two models
LICENSE
Text of the GNU General Public License, Version 3
README.md
This file
pom.xml
The Maven project object model

Installation

Downloads

This experimental fork is maintained at https://github.com/Noahs-ARK/semafor. For a more stable version, the latest official release, SEMAFOR v2.1, can be downloaded from https://github.com/Noahs-ARK/semafor-semantic-parser

In preprocessing, SEMAFOR uses MaltParser as the syntactic dependency parser. To use MaltParser, download and unpack the model files for MaltParser and SEMAFOR from here: http://www.ark.cs.cmu.edu/SEMAFOR/semafor_malt_model_20121129.tar.gz (~140MB). The model file for the MaltParser was trained on sections 02-21 of the WSJ section of the Penn Treebank, and the model files for SEMAFOR were trained on the FrameNet 1.5 datasets.

Environment Variables

The file bin/config.sh lists a set of variables which should be modified within the file before running SEMAFOR:

Compilation

Compilation is easiest using Maven version >= 3.0 (http://maven.apache.org/).

mvn package

will compile and package Semafor-3.0-alpha-04.jar (including all dependencies) to the target/ directory. Many scripts in bin/ point to Semafor-3.0-alpha-04.jar, so run mvn package immediately after installing, and again after making any changes to source code.

Running the Frame-Semantic Parser

./bin/runSemafor.sh <absolute-path-to-input-file-with-one-sentence-per-line> <output-file> <number-of-threads>

Some users have reported improved and more consistent runtime behavior when enabling NUMA. If your system is NUMA-capable, you can enable it, with the JVM option -XX:+UseNUMA, which requires the -XX:+UseParallelGC option to also be specified.

Server Mode

SEMAFOR can also be run as a TCP socket server. It accepts dependency parses in conll format, and replies with json frame-semantic parses. Run the following command

java -Xms4g -Xmx4g -cp target/Semafor-3.0-alpha-04.jar edu.cmu.cs.lti.ark.fn.SemaforSocketServer model-dir:<directory-of-trained-model> port:<port>

The message: Listening on port: NNNN will appear once the server has loaded the model and is ready to accept connections (where NNNN is the port). You can test that it's working with the following:

cat src/test/resources/fixtures/example.conll | nc localhost NNNN

(where NNNN is again the port).

Retraining SEMAFOR

Please see the training README.

Further Reading

If this parser is used, please cite the following papers, depending on the components used:

  1. An Exact Dual Decomposition Algorithm for Shallow Semantic Parsing with Constraints Dipanjan Das, André F. T. Martins, and Noah A. Smith Proceedings of *SEM 2012 (Please cite the above paper if you use AD^3 within SEMAFOR.)

  2. Graph-Based Lexicon Expansion with Sparsity-Inducing Penalties Dipanjan Das and Noah A. Smith Proceedings of NAACL 2012

  3. Semi-Supervised Frame-Semantic Parsing for Unknown Predicates Dipanjan Das and Noah A. Smith Proceedings of ACL 2011 (Please cite the above two papers if you use the graph-based filters within SEMAFOR.)

  4. Probabilistic Frame-Semantic Parsing Dipanjan Das, Nathan Schneider, Desai Chen, and Noah A. Smith Proceedings of NAACL-HLT 2010 (The first paper describing SEMAFOR.)

For further information, please read:

  1. SEMAFOR 1.0: A Probabilistic Frame-Semantic Parser Dipanjan Das, Nathan Schneider, Desai Chen, and Noah A. Smith CMU Technical Report, CMU-LTI-10-001

  2. Semi-Supervised and Latent Variable Models of Natural Language Semantics Dipanjan Das Ph.D. Thesis, Carnegie Mellon University, May 2012

Details of the training and test sections of the FrameNet 1.5 datasets can be found in paper 3. The supplementary material document for this paper lists the names of the test documents, and can be found here: http://www.dipanjandas.com/files/acl-hlt2011-suppl-semafor.pdf

Contact

If you find any bugs or have questions, please email Sam Thomson (sthomson@cs.cmu.edu), Dipanjan Das (dipanjan@cs.cmu.edu, dipanjand@gmail.com), or Nathan Schneider (nschneid@cs.cmu.edu, neatnate@gmail.com).

Version History