epfl-dlab / quootstrap

Unsupervised method for extracting quotation-speaker pairs from large news corpora.
27 stars 2 forks source link

Quootstrap

This is the reference implementation of Quootstrap, as described in the paper [PDF]:

Dario Pavllo, Tiziano Piccardi, Robert West. Quootstrap: Scalable Unsupervised Extraction of Quotation-Speaker Pairs from Large News Corpora via Bootstrapping. In Proceedings of the 12th International Conference on Web and Social Media (ICWSM), 2018.

Abstract

We propose Quootstrap, a method for extracting quotations, as well as the names of the speakers who uttered them, from large news corpora. Whereas prior work has addressed this problem primarily with supervised machine learning, our approach follows a bootstrapping paradigm and is therefore fully unsupervised. It leverages the redundancy present in large news corpora, more precisely, the fact that the same quotation often appears across multiple news articles in slightly different contexts. Starting from a few seed patterns, such as ["Q", said S.], our method extracts a set of quotation-speaker pairs (Q,S), which are then used for discovering new patterns expressing the same quotations; the process is then repeated with the larger pattern set. Our algorithm is highly scalable, which we demonstrate by running it on the large ICWSM 2011 Spinn3r corpus. Validating our results against a crowdsourced ground truth, we obtain 90% precision at 40% recall using a single seed pattern, with significantly higher recall values for more frequently reported (and thus likely more interesting) quotations. Finally, we showcase the usefulness of our algorithm's output for computational social science by analyzing the sentiment expressed in our extracted quotations.

Dataset

We release our dataset of quotations as a JSON-formatted file. This is the output of our algorithm on the ICWSM 2011 Spinn3r dataset, which spans one month (from January 13th, 2011 to February 14th, 2011) and contains relevant events such as the Egyptian protests, the Tunisian revolution, and the Super Bowl XLV. The collection consists of 170k quotation-speaker pairs. For more information about the dataset (such as the row format), refer to the "Exporting results" section. The conditions for using the dataset are described in the License section.

Download URL (25 MB compressed, 140 MB decompressed): https://drive.google.com/file/d/1ybp71ClLUkFnADQEFgcpC2qGuwFx9IEp/view

How to run

Go to the Release section and download the .zip archive, which contains the executable quootstrap.jar as well as all necessary dependencies and configuration files. You can also find a convenient script extraction_quotations.sh that can be used to run the application on a Yarn cluster. The script runs this command:

spark-submit --jars spinn3r-client-3.4.05-edit.jar,stanford-corenlp-3.8.0.jar,jsoup-1.10.3.jar,guava-14.0.1.jar \
    --num-executors 8 \
    --executor-cores 38 \
    --driver-memory 192g \
    --executor-memory 192g \
    --conf "spark.yarn.executor.memoryOverhead=32768" \
    --class ch.epfl.dlab.quootstrap.QuotationExtraction \
    --master yarn \
    quootstrap.jar $1

where $1 represents the log/output directory. After tuning the settings to suit your particular configuration, you can run the command as:

./extraction_quotations.sh outputDirectory

Setup

To run our code, you need:

The Spinn3r dataset must be converted to JSON format using the tool that we provide. You can find more details in the README in src\main\java\ch\epfl\dlab\spinn3r\converter. The reason for this requirement is that our format is more suitable for distributed processing, whereas the original dataset is stored in large non-splittable archives.

For reference, this is the command that we have used. It creates an output file dataset.json, stored on HDFS and split into 500 gzip-compressed chunks. Only news articles are kept, and duplicates are removed.

spark-submit \
    --jars spinn3r-client-3.4.05-edit.jar,stanford-corenlp-3.8.0.jar,jsoup-1.10.3.jar \
    --num-executors 8 \
    --executor-cores 38 \
    --driver-memory 128g \
    --executor-memory 128g \
    --conf "spark.yarn.executor.memoryOverhead=16384" \
    --class ch.epfl.dlab.spinn3r.converter.ProtoToJson --master yarn \
    quootstrap.jar /datasets/Spinn3r/icwsm2011/*-OTHER.tar.gz dataset.json \
    --compress=GzipCodec --partitions=500 --source-type=MAINSTREAM_NEWS --remove-duplicates

How to build

Clone the repository and import it as an Eclipse project. All dependencies are downloaded through Maven. To build the application, generate a .jar file with all source files and run it as explained in the previous section. Alternatively, you can use Spark in local mode for experimenting. Additional instructions on how to extend the project with new functionalities (e.g. support for new datasets) are reported later.

Configuration

The first configuration file is config.properties. The most important fields in order to get the application running are:

The second configuration file is seedPatterns.txt, which, as the name suggests, contains the seed patterns that are used in the first iteration, one by line.

Evaluation

If either ENABLE_FINAL_EVALUATION or ENABLE_INTERMEDIATE_EVALUATION are enabled in the configuration, the application produces a report by comparing the output against the ground truth. Note that this is a costly operation: if you don't need it, you are advised to disable it (or enable just the final evaluation). For each iteration X, the following files are generated:

By enabling DEBUG_DUMP_PATTERNS, three additional files will be generated at each iteration:

Note that a valid pattern:

For instance, if we have the pattern $Q said $* $* $S . and the sentence "We are called here to mourn an unspeakable act of violence" said House Speaker John Boehner., the resulting tuple is (We are called here to mourn an unspeakable act of violence, John Boehner).

Exporting results

You can export the retrieved quotation-speaker pairs by setting EXPORT_RESULTS to true and setting the HDFS output PATH on EXPORT_PATH. Again, this is a costly operation, so we recommend you to disable it if not needed. The results are saved as a HDFS text file formatted in JSON, with one record per line. For each record, the full quotation is exported, as well as the full name of the speaker (as reported in the article), his/her unique Freebase ID, the confidence value of the tuple, and the occurrences in which the quotation was found. As for the latter, we report the article ID, an incremental offset within the article (which is useful for linking together split quotations), the pattern that extracted the tuple along with its confidence, the website, and the date the article appeared.

{
  "quotation": "Now, going forward, this moment of volatility has to be turned into a moment of promise. The United States has a close partnership with Egypt and we've cooperated on many issues, including working together to advance a more peaceful region. But we've also been clear that there must be reform -- political, social and economic reforms that meet aspirations of the Egyptian people,",
  "canonicalQuotation": "now going forward this moment of volatility has to be turned into a moment of promise the united states has a close partnership with egypt and weve cooperated on many issues including working together to advance a more peaceful region but weve also been clear that there must be reform political social and economic reforms that meet aspirations of the egyptian people",
  "speaker": "Barack Obama",
  "speakerID": "http://rdf.freebase.com/ns/m.02mjmr",
  "confidence": 1,
  "occurrences": [
    {
      "articleUID": "1296259645046202892",
      "articleOffset": 4,
      "extractedBy": "$Q , $S said",
      "patternConfidence": 1,
      "quotation": "This moment of volatility has to be turned into a moment of promise,",
      "website": "www.daytondailynews.com",
      "date": "2011-01-28T23:58:06Z"
    },
    {
      "articleUID": "1296272315046514180",
      "articleOffset": 0,
      "extractedBy": "$Q , $S said",
      "patternConfidence": 1,
      "quotation": "This moment of volatility has to be turned into a moment of promise,",
      "website": "www.guardian.co.uk",
      "date": "2011-01-29T00:36:25Z"
    },
    {
      "articleUID": "1296282767057311234",
      "articleOffset": 7,
      "extractedBy": "$Q , $S said",
      "patternConfidence": 1,
      "quotation": "Going forward, this moment of volatility has to be turned into a moment of promise,",
      "website": "www.theledger.com",
      "date": "2011-01-29T05:19:02Z"
    }
  ]
}

Remarks:

Adding support for new datasets/formats

If you want to add support for other datasets/formats, you can provide a concrete implementation for the Java interface DatasetLoader and specify its full class name in the NEWS_DATASET_LOADER field of the configuration. For each article, you must supply a unique ID (int64/long), the website in which it can be found, and its content in tokenized format, i.e. as a list of strings. We provide an implementation for our JSON Spinn3r dataset in ch.epfl.dlab.quootstrap.Spinn3rDatasetLoader, and for parquet dataframes in ch.epfl.dlab.quootstrap.ParquetDatasetLoader.

Replacing the tokenizer

If, for any reason (e.g. license, language other than English), you do not want to depend on Stanford PTBTokenizer, you can provide your own implementation of the ch.epfl.dlab.spinn3r.Tokenizer interface. You only have to implement two methods: tokenize and untokenize. Tokenization is one of the least critical steps in our pipeline, and does not impact the final result significantly.

License

We release our work under the MIT license. Third-party components, such as Stanford CoreNLP, are subject to their respective licenses.

If you use our code and/or data in your research, please cite our paper [PDF]:

@inproceedings{quootstrap2018,
  title={Quootstrap: Scalable Unsupervised Extraction of Quotation-Speaker Pairs from Large News Corpora via Bootstrapping},
  author={Pavllo, Dario and Piccardi, Tiziano and West, Robert},
  booktitle={Proceedings of the 12th International Conference on Web and Social Media (ICWSM)},
  year={2018}
}