xml2ccg

shoeffner commented 5 years ago

Note: This PR relies on #18 and #19 and thus contains the same commits as well. Once those are merged, it will be slightly smaller. I can also rebase/squash etc. for a shorter history.

This PR introduces a script xml2ccg, which is roughly the inverse of ccg2xml. Since the recommended way to edit grammars is not fiddling around with xml files but with a ccg file, the tool should only be seen as a one-off generator of a lost ccg file.

I am looking forward to your review and feedback!

Changelog

Features

xml2ccg.py: a new script to create a ccg file from a directory containing the appropriate grammar xml files. It comes with the same xml2ccg and xml2ccg.bat convenience scripts as ccg2xml. Just like ccg2xml.py, it is copied to the bin directory using the ccg-build process. However, it is not auto-generated.

xml2ccg is tested as follows:

Each available ccg grammar (arabic.ccg, tiny.ccg, tinytiny.ccg, grammar_template.ccg, inherit.ccg) was converted to its xml counterpart and put into the test/ccg2xml directory. 1a. An additional hand-crafted grammar (diaspace, LGPL 2.1+) is used, although no original ccg files exist anymore.
The test_xml2ccg.py generates a ccg file from each xml directory and then generates a new xml directory from that temporary directory.
The original xml directory and the new xml directory are compared (except for the properties of the root elements and the grammar.xml's file attributes).
While this works well for all ccg2xml generated grammars, for the hand-crafted grammar a few looser rules are needed:
- The newly generated grammar is allowed to have more entries. It is possible that some implicit macro definitions were not explicitly written by hand, while the ccg2xml generator adds those. This is especially the case for the types.xml, which lists all macro types explicitly when generated via ccg2xml, while the hand-crafted variant only contains ontology types.
- The ccg2xml tool has a few small inconsistencies with the documentation in tiny.ccg for handling certain situations, especially how case<0>: acc0:p-case; is converted. According to tiny.ccg it should become:
```
<macro name="@Acc0">
<fs id="0" attr="case" val="p-case"/>
</macro>
```
  but instead becomes
```
<macro name="@Acc0">
<fs id="0">
<feat attr="case" val="p-case" />
</fs>
</macro>
```
  These two variants, however, represent the same content in some way. So for the final comparison in the test, macro/fs/feat with a val != None is treated in the same way as macro/fs.

Fixes & smaller changes

Multiple entries with similar names would be discarded by ccg2xml, as for some complex xml structures only shallow copies have been performed. This was wrapped into deepcopies (ccg.ply:785, ccg.ply:1881)
warning_count was not defined or properly used and thus removed (ccg.ply)
Removes the executable file permission from various files (ccg.ply, README, arabic.ccg)
Indentation and whitespacing in the build.xml and src/ccg2xml/build.xml is streamlined

Caveats

ccg2xml ignores macro names when generating xml files but instead uses the entity names prefixed with an @ for macro names. Thus, an entry MACRO<NOMVAR:MODE>: NAME; results in
```
<macro name="@NAME">
<lf>
  <satop nomvar="NOMVAR">
    <diamond mode="MODE">
      <prop name="NAME" />
    </diamond>
  </satop>
</lf>
</macro>
```
instead of using macro name="@MACRO". Thus, in a handcrafted xml where macro names are different from prop names, the information is converted "properly" to ccg, but lost on the conversion back, leading to some strange errors. The only solution to this problem is to change the xml files before hand, so that the prop names and macro names are the same (and unique) already.
The grammar.xml's content is largely ignored, the script assumes all files to be in the same directory instead of following the paths inside grammar.xml.

shoeffner commented 5 years ago

WIth commit 66aa180 / 2decc3b I identified a couple of problems with the test suite which resulted in some wrongly translated xml's to slip through (below are only the important excerpts). I am currently working on a fix for these parses.

arabic morph:

Original:
<fs id="2">
    <feat attr="PERS" val="1st"/>
</fs>

Generated:
<fs attr="PERS" id="2" val="1st"></fs>

arabic lexicon:

Original:
<feat attr="lex" val="[*DEFAULT*]"/>

Generated:
<feat attr="lex" val="*"/>

diaspace lexicon:

Original:
<feat attr="num">
    <featvar name="NUM"/>
</feat>

Generated:
<feat attr="NUM">
    <featvar name="NUM"/>
</feat>

diaspace rules:

Original:
<typeraising dir="forward" useDollar="false">
    <arg>
        <atomcat type="pper"/>
    </arg>
</typeraising>

Generated:
<typeraising dir="forward" useDollar="false"/>

inherit lexicon:

Original:
<feat attr="index">
    <lf>
        <nomvar name="E"/>
    </lf>
</feat>

Generated:
// nothing (other feats are processed, but feats containing lf not)

tiny has multiple of the above issues but no new issues.

shoeffner commented 5 years ago

The only remaining problem with the diaspace grammar is now family entries which have features of the following type:

<feat attr="modality">
    <lf>
        <nomvar name="SM:gs-SpatialModality"/>
    </lf>
</feat>

These are currently parsed into [modality], thus the information about the nomvar is lost. I am not sure, if modality is even a thing to be treated special like the "index" features -- and if so, it could only also work with a single uppercase letter as its name, i guess.

In either case, I don't know how to represent this in ccg so that it would generate the right output. Maybe the original xml grammar can be changed or this is something the ccg format does not support, while OpenCCG does.

shoeffner commented 5 years ago

Similarly to the above mentioned modality attributes, in the diaspace grammar there are a few index attributes with complex names:

<feat attr="index">
    <lf>
        <nomvar name="GL:gs-GeneralizedLocation"/>
    </lf>
</feat>

Since ccg2xml parses only (so it seems) single uppercase letters properly into index attributes, this index feature gets translated from its current ccg representation

[GL:gs-GeneralizedLocation]

into

<feat attr="GL">
    <featvar name="GL:gs-GeneralizedLocation"/>
</feat>

Is this a limitation of the ccg files? Or are those errors in the grammar which should not be possible in xml either?

These two issues seem ( :-) ) to be the remaining problems for xml2ccg. Do you have any ideas on how to progress with these?

mwhite14850 commented 5 years ago

This may be a limitation of what ccg2xml can parse. But in general the ability to support LF-valued features (beyond the special index feature) is an important part of the native XML grammar format (note that the .ccg format was designed for easier human authoring but was never exhaustively checked against what the native XML format supports). In the flights and comic grammars (under openccg/grammars), LF-valued features are used to propagate the info and owner features from the semantics to the syntax, in order to implement a version of Steedman's theory of communicative structure (theme/rheme and 'kontrast'), which is described in this article [http://aclweb.org/anthology/J10-2001.pdf]. One way to wrap up xml2ccg, of course, would be to emit warnings when a native XML grammar cannot be adequately translated to .ccg; another would be to try to make ccg2xml complete, but that option would not be for the faint of heart.

shoeffner commented 5 years ago

Thank you, I was already afraid that this would be the case. I will consider the options and see if I can find some time over the holidays to implement one or the other.

OpenCCG / openccg

xml2ccg #20

xml2ccg

Changelog

Features

Fixes & smaller changes

Caveats