boberle / corefconversion

Conversion scripts for coreference
Mozilla Public License 2.0
27 stars 4 forks source link

Conversion Scripts for Coreference

This repository contains conversion scripts for coreference.

Here is a "table of contents":

The jsonlines2text.py script

Script to convert from a jsonlines file to a text representation of coreference annotation. The output is html. Mentions are surrounded by brackets. Coreference chains are represented by colors (each chain has a specific color) and, if requested by a switch, an index (1, 2, 3...). Singletons may be hidden or shown in a specific color (gray by default), without any index.

If your jsonlines file contains several documents, you may show the document name by using the --heading option.

In any case, use the -h and --help switches to get a detailed list of options.

Here are some example (command then illustration):

(1) Color without index:

python3 jsonlines2text.py testing/docs.jsonlines -o output.html

(2) Color with index:

python3 jsonlines2text.py testing/docs.jsonlines -i -o output.html

(Note: indices don't start at 1 in the image becaue it's not the beginning of the text.)

(3) Hide singletons:

python3 jsonlines2text.py testing/docs.jsonlines -i -o output.html --sing-color ""

(4) No color (cm stands for color manager):

python3 jsonlines2text.py testing/docs.jsonlines -i -o output.html --sing-color "" --cm ""

(5) Using common html colors (more constrast, but fewer available colors, so several chains may have the same color):

python3 jsonlines2text.py testing/docs.jsonlines -i -o output.html --sing-color "" --cm "common"

(6) Limiting the output to the N first tokens:

python3 jsonlines2text.py testing/docs.jsonlines -i -o output.html -n 100

The jsonlines2conll.py script

Script to convert a jsonlines file to a CoNLL file. Use the -h and --help switches to get detailed help on the options.

Example command (output uses spaces):

python3 jsonlines2conll.py -g testing/singe.jsonlines -o ouput.conll
#begin document (ge/articleswiki_singe.xml); part 000
Singe   (0)

         Les         (0
      singes         0)
        sont          -
         des         (0
  mammifères          -
          de          -
          l'         (1
       ordre          -
         des          -
          de          -
         les         (2
    primates      1)|2)
...
#end document

Example command (merging coreference information with an existing conll file, for example to add predicted coreference):

python3 jsonlines2conll.py -g testing/singe.jsonlines -o ouput.conll -c testing/singe.conll
#begin document (ge/articleswiki_singe.xml); part 000
1   Singe   Singe   NOUN   ...

   1            Les             le     DET   ...
   2         singes          singe    NOUN   ...
   3           sont           être     AUX   ...
   4            des             un     DET   ...
   5     mammifères      mammifère    NOUN   ...
   6             de             de     ADP   ...
   7             l'             le     DET   ...
   8          ordre          ordre    NOUN   ...
9-10            des              _       _   ...
   9             de             de     ADP   ...
  10            les             le     DET   ...
  11       primates        primate    NOUN   ...
...
#end document

Example command (merging + output uses tabulation):

python3 jsonlines2conll.py -g testing/singe.jsonlines -o ouput.conll -c testing/singe.conll -T

The conll2jsonlines.py script

Script to convert a conll formatted file to a jsonlines formatted file. Use the -h and --help switches to get detailed help on the options.

For example, to convert from the original CoNLL2012 format into jsonlines format:

python3 conll2jsonlines.py \
  --token-col 3 \
  --speaker-col 9 \
  INPUT_FILE \
  OUTPUT_FILE

To convert from the StanfordNLP format into jsonlines format:

python3 conll2jsonlines.py \
  --skip-singletons \
  --skip-empty-documents \
  --tab \
  --ignore-double-indices 0 \
  --token-col 1 \
  --speaker-col "_" \
  --no-coref \
  INPUT_FILE \
  OUTPUT_FILE

To convert from the Democrat corpus in CoNLL format (with a column for paragraphs at position 11):

python3 conll2jsonlines.py \
  --tab \
  --ignore-double-indices 0 \
  --token-col 1 \
  --speaker-col "_" \
  --par-col 11 \
  testing/singe.conll \
  testing/singe.jsonlines

Note that you may have to change document keys in the CoNLL files before running this script if you want to transform them.

Output sample:

{
   "doc_key": "(ge/articleswiki_singe.xml); part 000",
   "clusters": [[[0, 0], [1, 2], [4, 12]], [[7, 12]], [[11, 12]]],
   "sentences": [["Singe"],
                 ["Les", "singes", "sont", "des", "mammif\u00e8res", "de",
                  "l'", "ordre", "des", "de", "les", "primates", "."]],
   "speakers": [["_"],
                ["_", "_", "_", "_", "_", "_",
                 "_", "_", "_", "_", "_", "_", "_"]],
   "paragraphs": [[0, 0], [1, 13]]
}

The sacr2conll.py script

Script to convert from the SACR format to a CONLL-2012-like format. Note that the CONLL format which is produced:

Here is the command to convert the SACR files in the testing directory:

python3 sacr2conll.py -o testing/testing_sacr2conll.conll testing/*.sacr

This will produce a testing/testing_sacr2conll.conll file which contains all the input files specified on the command line converted into a CONLL-like format.

Here is an extract:

#begin document (aesop.sacr); part 000
0   A   (0
1   Peasant 0)
2   found   -
3   an  (1
4   Eagle   -
5   captured    -
6   in  -
7   a   (2
8   trap    2)_1)
...
#end document

#begin document (caesar.sacr); part 000
0   Gaius   (0
1   Julius  -
2   Caesar  0)
3   (   -
4   12  (1
5   or  -
6   13  -
7   July    -
8   100 -
9   BC  1)
10  –   -
11  15  (2
12  March   -
13  44  -
14  BC  2)
15  )   -
16  ,   -
17  known   -
18  by  -
19  his (3_(4_(0)
20  nomen   4)
21  and -
22  cognomen    (5)
23  Julius  -
24  Caesar  3)
...

Please refer to the -h option for a complete list of options.

With the --speaker switch, you can add a 4th column, which will be placed before the coreference columns. In the SACR file, the speaker can be mentionned as a comment prefixed with #speaker: before each line, like this:

#title: Lucian, Dialogues of the Dead, 4: Hermes and Charon

#speaker: Hermes
Ferryman, what do you say to settling up accounts? It will prevent any
unpleasantness later on.

#speaker: Charon
Very good. It does save trouble to get these things straight.

This will produce a CoNLL file like this:

#begin document (lucian_speakers.sacr); part 000
0   Ferryman    Hermes  -
1   ,   Hermes  -
2   what    Hermes  -
3   do  Hermes  -
4   you Hermes  -
5   say Hermes  -
6   to  Hermes  -
7   settling    Hermes  -
8   up  Hermes  -
9   accounts    Hermes  -
10  ?   Hermes  -

0   It  Hermes  -
1   will    Hermes  -
2   prevent Hermes  -
3   any Hermes  -
4   unpleasantness  Hermes  -
5   later   Hermes  -
6   on  Hermes  -
7   .   Hermes  -

0   Very    Charon  -
1   good    Charon  -
2   .   Charon  -

You can remove the speaker for a paragraph by setting:

#speaker:
... the text of the narrator ...

A test file is available in testing/lucian_speakers.sacr.

To convert a SACR file to a jsonline file, you will need to run these two commands:

python3 sacr2conll.py -s -o /tmp/lucian_speakers.conll testing/lucian_speakers.sacr
python3 conll2jsonlines.py --token-col 1 --speaker-col 2 /tmp/lucian_speakers.conll /tmp/lucian_speakers.jsonlines

The conll2sacr.py script

The opposite of sacr2conll.py. It converts a CONLL-2012 or CONLL-X file into a SACR file.

The script take an output directory as parameter: all documents in the CONLL file will be output as a different file in this directory.

Here is the command to convert back the SACR files converted to CONLL in the previous section:

python3 conll2sacr.py \
   --output-dir testing_conll2sacr \
   --tab \
   --token-col 1 \
   testing/testing_sacr2conll.conll

Note the --tab option (because here the CONLL file is tab separated) and the --token-col option which indicates that the tokens are to be found in the second column (index starts at 0).

If you were to parse a real CONLL-2012 (the original format), you would have to drop the --tab option (because the original format is space (not tab) spearated) and the --token-col option (or set it to 3, which is the default).

The command produced a series of files in the testing_conll2sacr directory:

_aesop.sacr___part_000
_caesar.sacr___part_000
_cicero.sacr___part_000
_pliny.sacr___part_000
_simple.sacr___part_000

Note that special characters (here the parentheses and spaces) have been replaced by underscores.

Please refer to the -h option for a complete list of options.

text2jsonlines.py

Script to convert a plain text to a jsonlines format (used for example for cofr).

It tokenizes the text with StanfordNLP. You need to install StanfordNLP via pip and then load the models, for example for French models (use "en" for English models):

python3 -c "import stanfordnlp; stanfordnlp.download('fr')"

Notes:

Usage:

python3 text2jsonlines.py <plain.txt> -o <output.jsonlines>

Choose the language with the --lang option (en by default, use fr for French).

Example with the sentence "I eat an apple.":

{
   "doc_key": "ge:input.txt",
   "sentences": [["I", "eat", "an", "apple", "."]],
   "speakers": [["_", "_", "_", "_", "_"]],
   "clusters": [],
   "pos": [["PRON", "VERB", "DET", "NOUN", "PUNCT"]],
   "paragraphs": [[0, 4]]
}

jsonlines2tei.py

Script to convert the jsonlines format into a TEI-URS format used by softwares such as TXM. See the jsonlines2tei repository.

Function library conll_transform.py

Module containing several functions to manipulate conll data:

This module is available on PyPI. To download it:

pip3 install conll-transform

To use it, just import the functions from conll_transform, for example:

from conll_transform import read_files

documents = read_files("myfile.conll", "myfile2.conll")
print(documents)

Convert SACR file to Brat Standoff Annotation using sacr2ann.py

The script sacr2ann.py will convert a SACR file to a set of 2 files used with BRAT:

The format is described here.

Only a subset of the BRAT annotations is taken into account for now, namely the text-bound anntoations (with a leading T), and the relation annotations (with a leading R).

The type of the text annotations is found in the --type-property-name of SACR annotations. This means that if your SACR schema holds a property type that contains the type of the annotation, and you want the type to be reflected in your BRAT annotation, then you would call sacr2ann.py with the option --type-property-name type. If this option is not specify, or if a mention is missing, then the default type is Mention.

The type of the relation annotation is Coreference.

If you need more annotations from the BRAT format, don't hesitate to ask me by sending me message or opening an issue.

Here is an example.

Let's say you have the following SACR file aesop.sacr:

{Peasant:type="Person" A Peasant} found {Eagle:type="Animal" an Eagle captured in {M3:type="Object" a trap}},
and much admiring {Eagle:type="Animal" the bird}, set {Peasant:type="Person" him} free.

Then running:

python3 sacr2ann.py --type-property-name type aesop.sacr

will produce 2 files, aesop.sacr.txt and aesop.sacr.ann. You can specify the output files with the --txt and --ann options.

Here is the .txt file:

A Peasant found an Eagle captured in a trap,and much admiring the bird, set him free.

Here is the .ann file:

T1      Person 0 9      A Peasant
T2      Animal 16 43    an Eagle captured in a trap
T3      Object 37 43    a trap
T4      Animal 62 70    the bird
R1      Coreference Arg1:T2 Arg2:T4
T5      Person 76 79    him
R2      Coreference Arg1:T1 Arg2:T5

Pandas Dataframes and relational databases with Annotable, sacr2annotable.py and sacr2df.py

The Annotable class and its subclasses Corpus, Text, Paragraph, Sentence, Token, Mention and Chain are a great way to transform a corpus into a series of dataframes usable with Pandas.

The script sacr2annotable.py is a parser of SACR files into a Corpus object that can be used to convert into dataframes or CSV files.

The script sacr2df.py will take a series of SACR files as input, and output a zip file with CSV files, you can use it like this:

python3 sacr2df.py text1.sacr text2.sacr ... -o output_file.zip

You can also use it as a library, for example in a Jupyter notebook (see below for an example):

from sacr2df import convert_sacr_files_to_dataframes
from pathlib import Path

dfs = convert_sacr_files_to_dataframes(
    Path("testing/aesop.sacr"),
    Path("testing/caesar.sacr"),
    Path("testing/cicero.sacr"),
    Path("testing/pliny.sacr"),
)

# then do something with the dfs:
print(dfs.texts.head())
print(dfs.paragraphs.head())
print(dfs.sentences.head())
print(dfs.tokens.head())
print(dfs.text_chains.head())
print(dfs.text_mentions.head())
print(dfs.text_consecutive_relations.head())
print(dfs.text_to_first_relations.head())

Each dataframe contains a series of columns containing index information (like the index of the mention in the chain), count information (chain size, token length), strings (the actual string of a mention or a token), metadata (properties for mentions, like part of speech or function, if they are annotated, metadata for texts, as anotated in the file), etc. You will find the whole list below.

For mentions, properties are added as columns of the dataframe or CSV file. For example, with a mention like this:

{chain1:partofspeech="noun",function="subject" John} ...

2 columns will be added to the dataframe/csv: partofspeech and function.

You can add metadata to each text by adding them like this:

#textid:The Raven

#textmetadata:type=literature
#textmetadata:century=19
#textmetadata:author=Edgar Poe

Once upon a midnight dreary, while I pondered, weak and weary,
Over many a quaint and curious volume of forgotten lore, 
...

Here, the following columns will be added to the dataframe/csv of texts:

You can add as many metadata as you want. If a text is missing a metadata, None will be recorded.

The dataframes/csv files all have an index, which acts like an id. The dataframes are related by ids as in other relational database. For example, mentions have a chain_id column that match the row index of the chains dataframe/csv. This means that you can import the CSV files into an SQL relational database, for example. You can also do join in dataframes. For example, if you want to associate the type of the text (literature, science, politics, etc., recorded as a text metadata) to each mentions, you will just perform a join:

joined = dfs.text_mentions.join(dfs.texts, on="text_id", lsuffix="_mention")
joined = joined[["chain_name", "function", "work"]]

You will got something like:

    chain_name  function     work
0   Peasant     s subject    literature
1   Peasant     o object     literature
2   Peasant     o object     literature
3   Peasant     o object     literature
4   M18         a adverbial  science
...

You can also use the CSV files to analyse the corpus with Excel. See an example here, starting at slide 67.

List of tables and columns:

Example of a notebook

(Find the notebook here or the html export here)

You will find a sample of a notebook in the docs directory. Here are some highlights:

Once you have loaded the files and get the dataframes as described above, you can perform usual operations on dataframes, like selected some rows, as here to show singletons (chains of size of 1 mentions):

You can use matplotlib to easily draw graphs (or seaboarn, etc.), as here for the part of speech of chain first mentions:

or the distribution of sentence lengths:

As mentioned earlier, you can use the dataframes like a relational database, and use the join (or merge) function of pandas, for example to add the type of work (literature, politics, science), recorded as text metadata, to each mention:

And then use a pivot table:

and draw a graph:

Convert a SACR file to a pair of GLOZZ files

The script sacr2glozz.pl is a Perl script to convert one SACR file to a pair of GLOZZ files, one .ac containing the text, and one .aa containing the annotations (in XML format).

The basic usage is as follows:

perl sacr2glozz.pl file.sacr glozzfilename

where glozzfilename is the file name of the two GLOZZ files (the extensions .ac and .aa are added automatically).

There are others options:

-m --min VALUE The minimum length of a chain.  If -e AND -p are set, then
               the chains with less links have the value specified in -e.
               Otherwise, they are excluded.
               Default is 0 (all links are included).
-e VALUE       Put VALUE in the the PROP_NAME property (if the -p option is
               used) for chains with less than -m. (E.g. "" or "SI" for
               SIngleton.)
-p PROP_NAME   Include a property PROP_NAME with the name of the referent.
               If empty string, don't use.
-s --schema    Include schemata.
-K             Don't keep comments.
-e             Explode head property into 'headpos' and 'headstring'.
-f REFNAME     Include only REFNAME (this option can be repeated).
--model        Build a Glozz annotation model (.aam).
--link-name VAL Name of the link (like 'link', 'mention', 'markable', etc.).
               Default is 'MENTION'.

Because it's a Perl file, if you need any assistance, please send me a message!

For this script to run, you will need to install XML::Simple. You can do it via CPAN, or if you are on a Debian based distro (like Ubuntu), then you can run sudo apt install libxml-simple-perl.

Convert a pair of GLOZZ files to a SACR file

The script glozz2sacr.pl is a Perl script to convert a pair of GLOZZ files, one .ac containing the text, and one .aa containing the annotations (in XML format) to a SACR file.

The basic usage is as follows:

perl glozz2sacr.pl glozzfile.aa out.sacr

where glozzfile.aa is one file of the GLOZZ file (you could have given also the .ac file). This assumes that both GLOZZ files (.ac and .aa have the same base name, for example abc.aa and abc.ac).

There are others options:

--ref-field    Name of the field where the referent is store (REF, refname,
               etc.). Default is REF.
--unit-type    Type of the unit (maillon, MENTION, etc.). Default is MENTION.
--reset        Get a new name for referent (useful if the name used in the
               glozz file contains non standard characters).

Because it's a Perl file, if you need any assistance, please send me a message!

For this script to run, you will need to install XML::Simple. You can do it via CPAN, or if you are on a Debian based distro (like Ubuntu), then you can run sudo apt install libxml-simple-perl.

Main formats used in automatic coreference resolution

CoNLL format

The CoNLL format is a tabular format: each token is on a separate line and annotation for the token are on separate column. Document boundaries are indicated by specific marks, and sentence separation by a white line.

Here is an example:

#begin document <name of the document>
1            Les             le     DET
2         singes          singe    NOUN
3           sont           être     AUX
4            des             un     DET
5     mammifères      mammifère    NOUN
...

1           Bien           bien     ADV
2            que            que   SCONJ
3           leur            son     DET
4   ressemblance   ressemblance    NOUN
5           avec           avec     ADP
6             l'             le     DET
7          Homme          homme    NOUN
...
#end document

Column separator (spaces or tabulation), number and content vary according to specification (CoNLL-2012, CoNLL-U, CoNLL-X, etc.).

Jsonlines format

The jsonlines format stores data for several texts (a corpus). Each line is a valid json document, as follows:

{
  "clusters": [],
  "doc_key": "nw:docname",
  "sentences": [["This", "is", "the", "first", "sentence", "."],
                ["This", "is", "the", "second", "."]],
  "speakers":  [["spk1", "spk1", "spk1", "spk1", "spk1", "spk1"],
                ["spk2", "spk2", "spk2", "spk2", "spk2"]]
  "pos":       [["DET", "V", "DET", "ADJ", "NOUN", "PUNCT"],
                ["DET", "V", "DET", "ADJ", "PUNCT"]],
  ...
}

It is used for some coreference resolution systems, such as:

Brat format

Brat offers a standoff annotation format that stores both the text and the annotations.

Here is an example of the annotations, usually saved in an .ann file:

T1      Person 0 9      A Peasant
T2      Animal 16 43    an Eagle captured in a trap
T3      Object 37 43    a trap
T4      Animal 62 70    the bird
R1      Coreference Arg1:T2 Arg2:T4
T5      Person 76 79    him
R2      Coreference Arg1:T1 Arg2:T5

The text file is just a plain text file (.txt).

You can find more here.

Glozz format

Glozz is (was?) an annotation platform (on which you find more here), that use a URS model (Units, Relations, Schemas, which correspond to Mentions, Relations, Chains). Annotations are standoff, and are stored in two file: a text file (.ac) and an xml file (.aa). Here is an example of the xml:

Code quality

Newer additions are type-checked with mypy and tested with pytest. For example: sacr_parser2.py, annotatble.py, sacr2ann.py and sacr2df.py.

flake8 and isort are also used. Use may want to run make check-lint or make lint to lint the code (older scripts are not affected, and are not type-checked nor linted). make test will run the tests.

License

All the scripts are distributed under the terms of the Mozilla Public Licence 2.0.