arne-cl / discoursegraphs

linguistic converter / merging tool for multi-level annotated corpora. graph-based (using Python and NetworkX).
BSD 3-Clause "New" or "Revised" License
50 stars 5 forks source link
conversion converter natural-language-processing networkx nlp python

DiscourseGraphs

.. image:: http://img.shields.io/pypi/v/discoursegraphs.svg :alt: Latest version :align: right :target: https://pypi.python.org/pypi/discoursegraphs .. image:: http://img.shields.io/badge/license-BSD-yellow.svg :alt: BSD License :align: right :target: http://opensource.org/licenses/BSD-3-Clause

.. image:: https://travis-ci.org/arne-cl/discoursegraphs.svg?branch=master :alt: Build status :align: right :target: https://travis-ci.org/arne-cl/discoursegraphs .. image:: https://codecov.io/github/arne-cl/discoursegraphs/coverage.svg?branch=master :alt: Test coverage :align: right :target: https://codecov.io/github/arne-cl/discoursegraphs?branch=master .. image:: https://www.quantifiedcode.com/api/v1/project/3076854b9ea74bed867f12808d98f437/badge.svg :alt: Code Issues :align: right :target: https://www.quantifiedcode.com/app/project/3076854b9ea74bed867f12808d98f437 .. image:: https://img.shields.io/docker/build/nlpbox/charniak.svg :alt: Docker build status :align: right :target: https://hub.docker.com/r/nlpbox/charniak

This library enables you to process linguistic corpora with multiple levels of annotations by:

  1. converting the different annotation formats into separate graphs and
  2. merging these graphs into a single multidigraph (based on the common tokenization of the annotation layers)
  3. exporting your (merged) graphs into several output formats
  4. visualizing linguistic graphs directly in an IPython notebook

.. visualizing linguistic graphs: http://nbviewer.ipython.org/github/arne-cl/alt-mulig/blob/master/python/discoursegraphs-visualization-examples.ipynb .. IPython notebook: http://ipython.org/notebook.html

Import formats

So far, the following formats can be imported and merged:

.. TigerXML: http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/TIGERSearch/doc/html/TigerXML.html .. NeGra Export Format: http://www.sfs.uni-tuebingen.de/resources/exformat3.ps .. RSTTool: http://www.wagsoft.com/RSTTool/ .. URML: http://www.david-reitter.com/compling/urml/index.html .. MMAX2: http://mmax2.sourceforge.net/ .. CoNLL 2009: http://ufal.mff.cuni.cz/conll2009-st/task-description.html .. CoNLL 2010: http://web.archive.org/web/20130119013221/http://www.inf.u-szeged.hu/rgai/conll2010st .. Conano: http://www.ling.uni-potsdam.de/acl-lab/Forsch/pcc/pcc.html

Export formats

discoursegraphs can export graphs into the following formats / for the following tools:

Installation

This should work on both Linux and Mac OSX using Python 2.7 and either pip or easy_install.

.. Python 2.7: https://www.python.org/downloads/ .. pip: https://pip.pypa.io/en/latest/installing.html

Install from PyPI


::

    pip install discoursegraphs # prepend 'sudo' if needed

or, if you're oldschool:

::

    easy_install discoursegraphs # prepend 'sudo' if needed

Install from source

::

sudo apt-get install python-dev libxml2-dev libxslt-dev pkg-config graphviz-dev libgraphviz-dev -y
sudo easy_install -U setuptools
git clone https://github.com/arne-cl/discoursegraphs.git
cd discoursegraphs
sudo python setup.py install

Usage

The command line interface of DiscourseGraphs allows you to merge syntax, rhetorical structure, connectives and expletives annotation files into one graph and to store this graph in one of several output formats (e.g. the geoff format used by the neo4j graph database or the dot_ format used by the graphviz plotting tool).

.. neo4j: http://www.neo4j.org/ .. dot: http://www.graphviz.org/content/dot-language .. _geoff: http://www.neo4j.org/develop/python/geoff

::

discoursegraphs -t syntax/maz-13915.xml -r rst/maz-13915.rs3 -c connectors/maz-13915.xml -a anaphora/tosik/das/maz-13915.txt -o dot
dot -Tpdf doc.dot > discoursegraph.pdf # generates a PDF from the dot file

If you're interested in working with just one of those layers, you'll have to call the code directly::

import discoursegraphs as dg
tiger_docgraph = dg.read_tiger('syntax/doc.xml')
rst_docgraph = dg.read_rs3('rst/doc.rs3')
expletives_docgraph = dg.read_anaphoricity('expletives/doc.txt')

All the document graphs generated in this example are derived from the networkx.MultiDiGraph_ class, so you should be able to use all of its methods.

.. _networkx.MultiDiGraph: http://networkx.lanl.gov/reference/classes.multidigraph.html

Documentation

Source code documentation is available here <https://pythonhosted.org/pypolibox/>, but you can always get an up-to-date local copy using Sphinx.

You can generate an HTML or PDF version by running these commands in the docs directory::

make latexpdf

to produce a PDF (docs/_build/latex/discoursegraphs.pdf) and ::

make html

to produce a set of HTML files (docs/_build/html/index.html).

.. _Sphinx: http://sphinx-doc.org/

Requirements

If you'd like to visualize your graphs, you will also need:

License and Citation

This software is released under a 3-Clause BSD license. If you use discoursegraphs in your academic work, please cite the following paper:

Neumann, A. 2015. discoursegraphs: A graph-based merging tool and converter for multilayer annotated corpora. In Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), pp. 309-312.

::

@inproceedings{neumann2015discoursegraphs,
  title={discoursegraphs: A graph-based merging tool and converter for multilayer annotated corpora},
  author={Neumann, Arne},
  booktitle={Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)},
  pages={309-312},
  year={2015}
}

Author

Arne Neumann

People who downloaded this also like

.. SaltNPepper: https://korpling.german.hu-berlin.de/p/projects/saltnpepper/wiki/ .. educe: https://github.com/irit-melodi/educe .. treetools: https://github.com/wmaier/treetools .. TCFnetworks: https://github.com/SeNeReKo/TCFnetworks