Citron is an experimental quote extraction and attribution system created by BBC R&D, based on a paper and a dataset developed by the School of Informatics at the University of Edinburgh.
It can be used to extract quotes from text documents, attributing them to the appropriate speaker and resolving pronouns where necessary. It supports direct and indirect quotes (with and without quotation marks respectively) and mixed quotes (which have direct and indirect parts). Note that there can be a significant number of errors and omissions. Extracted quotes should be checked against the input text.
You can run Citron using the pre-trained model or train your own model. You can also evaluate its performance.
Training and evaluating models requires data using Citron's Annotation Format. Citron provides pre-processing scripts to extract suitable data from the PARC 3.0 Corpus of Attribution Relations. Alternatively, you can create your own data using the Citron Annotator app.
Technical details and potential applications are discussed in: "Quote Extraction and Analysis for News".
Requires Python 3.7.2 or above. The package versions shown should be installed when using the pre-trained model.
git clone git@github.com:bbc/citron.git
Then from the citron root directory:
python3 -m pip install -r requirements.txt
Then from python3:
import nltk
nltk.download("names")
Scripts to run Citron are available in the bin/ directory.
All scripts require the citron root directory in the PYTHONPATH.
$ export PYTHONPATH=$PYTHONPATH:/path/to/citron_root_directory
$ citron-server
--model-path Path to Citron model directory
--logfile Path to logfile (Optional)
--port Port for the Citron API (Optional: default is 8080)
-v Verbose mode (Optional)
$ citron-extract
--model-path Path to Citron model directory
--input-file Path to input file (Optional: Otherwise read from stdin)
--output-file Path to output file (Optional: Otherwise write to stdout)
-v Verbose mode (Optional)
from citron.citron import Citron
from citron import utils
nlp = utils.get_parser()
citron = Citron(model_path, nlp)
doc = nlp(text)
quotes = citron.get_quotes(doc)
Issues can be reported on the issue tracker and questions can be raised on the discussion board.
Contributions would be welcome. Please refer to the contributing guidelines.
Licensed under the Apache License, Version 2.0.
The pre-trained model is separately licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International licence and the VerbNet 3.0 license.
For more information please contact: chris.newell@bbc.co.uk
Copyright 2021 British Broadcasting Corporation.