gangeli / ParsingTime

Parsing Time: Learning to Interpret Time Expressions
http://stanford.edu/~angeli/papers/2012-naacl-temporal.pdf
GNU Lesser General Public License v3.0
30 stars 3 forks source link

Parsing Time: Learning to Interpret Time Expressions


Abstract:

We present a probabilistic approach for learning to interpret temporal phrases given only a corpus of utterances and the times they reference. While most approaches to the task have used regular expressions and similar linear pattern interpretation rules, the possibility of phrasal embedding and modification in time expressions motivates our use of a compositional grammar of time expressions. This grammar is used to construct a latent parse which evaluates to the time the phrase would represent, as a logical parse might evaluate to a concrete entity. In this way, we can employ a loosely supervised EM-style bootstrapping approach to learn these latent parses while capturing both syntactic uncertainty and pragmatic ambiguity in a probabilistic framework. We achieve an accuracy of 72% on an adapted TempEval-2 task -- comparable to state of the art systems.


Compilation

Scala 2.9.1 JavaNLP April 9 2012

Most experiments can be run with something in the rc file interpret: run the interpretation model (last valid run) interpretAux: run the interpretation model (good parameters) sutimeInterpret: run SUTime heideltimeInterpret: run HeidelTime gutimeInterpret: run GUTime

detect: run the detection model (last valid run)

Models dist/time.jar: the program dist/interpretModel.ser.gz: serialized temporal interpretation model dist/detectModel.ser.gz: serialized temporal detection model (includes interpretation model) dist/log.gz: the log from the run producing interpretModel.ser.gz dist/options: the command line options from the run


TOUR

Some key classes are explained below. This is not an exhaustive list, but hopefully can help give a sense for the code.

Data.scala: Utilities for working with tempeval data, or the CoreMaps created from them class TimeDataset: holds the CoreMaps corresponding to the raw data object DataLib: utilities for working with the data

Detect.scala: Code for the detection model class CRFFeatures: the definition of the features used class CRFDetector: the class which performs and aggregates the detections class DetectionDatum: encapsulates the input information for detecting a time during training class CoreMapDetectionStore: the dataset used for detection; handles iterating over the data class DetectionTask: trains the CRF detection module

Entry.scala: General purpose code and main methods class Indexing: handles moving between strings and ints for POS and words object G: general utility fields object U: general utility functions class TimeSent: Embodies a sentence sent to the CKY parser trait DataStore: interface for a block of data (train/dev/test) object Interpret: MAIN for running interpretation or detection for either our system or another system (via a CoreNLP Annotator) object Entry: MAIN for training the models; this is the usual entry point for the program, when not evaluating

Gigaword.scala: Some old code to try to get more training instances from automatically parsed data

Interpret.scala: Most of the code for the interpretation model lives here class Grammar: defines the grammar used in CKY parsing object Grammar: creates the actual temporal grammar *class TreeTime: MODEL this is our model, which can be serialized and loaded. It is also a CoreNLP annotator, and does both detection and interpretation. class InterpretationTask: The code for training the interpretation model

O.java: All the options for training the model(s). Some of these may be dead

Tempeval2.scala: Code for processing the TempEval data into CoreMaps

Temporal.scala: The temporal representation trait Temporal: all temporal representations subclass this trait Range: all ranges subclass this (see paper) trait Duration: all durations subclass this (see paper) trait Sequence: all sequences subclass this (see paper). Note that it is both a Range and a Duration as well class Time: a point in time

class GroundedRange: the conventional begin-to-end range (e.g., 1998) class UngroundedRange: a 'floating' range (e.g., today) class GroundedDuration: the conventional x-milliseconds type of duration class FuzzyDuration: an approximate duration class RepeatedRange: the standard range-every-n-milliseconds type of sequence class CompositeRange: a complex sequence that we can't evaluate analytically class NoTime: returned when an invalid operation is performed and no such time actually exists (e.g., Feb 30) class UnkTime: a time that we can't represent

object Range: utility functions for handling ranges, notably intersecting *object Time: MAIN for running an interactive shell for evaluating temporal expressions. This is actually kind of a nifty domain specific language. object Lex: the definitions of times used in the grammar


Notes