Open shoeffner opened 5 years ago
WIth commit 66aa180 / 2decc3b I identified a couple of problems with the test suite which resulted in some wrongly translated xml's to slip through (below are only the important excerpts). I am currently working on a fix for these parses.
arabic morph:
Original:
<fs id="2">
<feat attr="PERS" val="1st"/>
</fs>
Generated:
<fs attr="PERS" id="2" val="1st"></fs>
arabic lexicon:
Original:
<feat attr="lex" val="[*DEFAULT*]"/>
Generated:
<feat attr="lex" val="*"/>
diaspace lexicon:
Original:
<feat attr="num">
<featvar name="NUM"/>
</feat>
Generated:
<feat attr="NUM">
<featvar name="NUM"/>
</feat>
diaspace rules:
Original:
<typeraising dir="forward" useDollar="false">
<arg>
<atomcat type="pper"/>
</arg>
</typeraising>
Generated:
<typeraising dir="forward" useDollar="false"/>
inherit lexicon:
Original:
<feat attr="index">
<lf>
<nomvar name="E"/>
</lf>
</feat>
Generated:
// nothing (other feats are processed, but feats containing lf not)
tiny has multiple of the above issues but no new issues.
The only remaining problem with the diaspace grammar is now family entries which have features of the following type:
<feat attr="modality">
<lf>
<nomvar name="SM:gs-SpatialModality"/>
</lf>
</feat>
These are currently parsed into [modality], thus the information about the nomvar is lost. I am not sure, if modality is even a thing to be treated special like the "index" features -- and if so, it could only also work with a single uppercase letter as its name, i guess.
In either case, I don't know how to represent this in ccg so that it would generate the right output. Maybe the original xml grammar can be changed or this is something the ccg format does not support, while OpenCCG does.
Similarly to the above mentioned modality attributes, in the diaspace grammar there are a few index attributes with complex names:
<feat attr="index">
<lf>
<nomvar name="GL:gs-GeneralizedLocation"/>
</lf>
</feat>
Since ccg2xml parses only (so it seems) single uppercase letters properly into index attributes, this index feature gets translated from its current ccg representation
[GL:gs-GeneralizedLocation]
into
<feat attr="GL">
<featvar name="GL:gs-GeneralizedLocation"/>
</feat>
Is this a limitation of the ccg files? Or are those errors in the grammar which should not be possible in xml either?
These two issues seem ( :-) ) to be the remaining problems for xml2ccg. Do you have any ideas on how to progress with these?
This may be a limitation of what ccg2xml can parse. But in general the ability to support LF-valued features (beyond the special index feature) is an important part of the native XML grammar format (note that the .ccg format was designed for easier human authoring but was never exhaustively checked against what the native XML format supports). In the flights and comic grammars (under openccg/grammars), LF-valued features are used to propagate the info and owner features from the semantics to the syntax, in order to implement a version of Steedman's theory of communicative structure (theme/rheme and 'kontrast'), which is described in this article [http://aclweb.org/anthology/J10-2001.pdf]. One way to wrap up xml2ccg, of course, would be to emit warnings when a native XML grammar cannot be adequately translated to .ccg; another would be to try to make ccg2xml complete, but that option would not be for the faint of heart.
Thank you, I was already afraid that this would be the case. I will consider the options and see if I can find some time over the holidays to implement one or the other.
Note: This PR relies on #18 and #19 and thus contains the same commits as well. Once those are merged, it will be slightly smaller. I can also rebase/squash etc. for a shorter history.
xml2ccg
This PR introduces a script xml2ccg, which is roughly the inverse of ccg2xml. Since the recommended way to edit grammars is not fiddling around with xml files but with a ccg file, the tool should only be seen as a one-off generator of a lost ccg file.
I am looking forward to your review and feedback!
Changelog
Features
xml2ccg.py
: a new script to create a ccg file from a directory containing the appropriate grammar xml files. It comes with the same xml2ccg and xml2ccg.bat convenience scripts as ccg2xml. Just like ccg2xml.py, it is copied to the bin directory using theccg-build
process. However, it is not auto-generated.xml2ccg is tested as follows:
case<0>: acc0:p-case;
is converted. According to tiny.ccg it should become:but instead becomes
These two variants, however, represent the same content in some way. So for the final comparison in the test, macro/fs/feat with a
val != None
is treated in the same way as macro/fs.Fixes & smaller changes
Caveats
MACRO<NOMVAR:MODE>: NAME;
results ininstead of using
macro name="@MACRO"
. Thus, in a handcrafted xml where macro names are different from prop names, the information is converted "properly" to ccg, but lost on the conversion back, leading to some strange errors. The only solution to this problem is to change the xml files before hand, so that the prop names and macro names are the same (and unique) already.grammar.xml
's content is largely ignored, the script assumes all files to be in the same directory instead of following the paths inside grammar.xml.