delph-in / pydelphin

Python libraries for DELPH-IN
https://pydelphin.readthedocs.io/
MIT License
79 stars 27 forks source link

Upper/lower case not normalized when encoding/decoding DMRX #333

Closed goodmami closed 2 years ago

goodmami commented 3 years ago

Some things in *MRS are considered case-insensitive, like predicates, morphosemantic property names and values, and variables, but XML is case-sensitive and the dmrx codec is currently outputting property names upper-cased. Python is also case-sensitive, so PyDelphin normalizes the case following the SimpleMRS conventions (variables, predicates, and property values down-cased; property names up-cased).

>>> from delphin.codecs import simplemrs
>>> m = simplemrs.decode('[ TOP: h0 RELS: < [ _RAIN_v_1 LBL: h1 ARG0: E2 [ e tense: PAST ] ] > HCONS: < h0 qeq h1 > ]')
>>> m.rels[0].predicate
'_rain_v_1'
>>> m.properties('e2')
{'TENSE': 'past'}

These conventions persist in the internal DMRS representation upon conversion, which is fine:

>>> from delphin import dmrs
>>> d = dmrs.from_mrs(m)
>>> d.properties(10000)
{'TENSE': 'past'}

But they should not persist in serialization to XML, where it would not follow the DTD:

>>> from delphin.codecs import dmrx
>>> dmrx.encode(d)
'<dmrs cfrom="-1" cto="-1" top="10000"><node nodeid="10000" cfrom="-1" cto="-1"><realpred lemma="rain" pos="v" sense="1" /><sortinfo TENSE="past" cvarsort="e" /></node></dmrs>'

Similarly, they are not normalized when decoding, unlike SimpleMRS:

>>> d = dmrx.decode('<dmrs cfrom="-1" cto="-1" top="10000"><node nodeid="10000" cfrom="-1" cto="-1"><realpred lemma="RAIN" pos="v" sense="1" /><sortinfo tense="PAST" cvarsort="E" /></node></dmrs>')
>>> d.nodes[0].predicate
'_RAIN_v_1'
>>> d.nodes[0].type
'E'
>>> d.nodes[0].properties
{'tense': 'PAST'}

This issue is mainly about DMRX as PyDelphin is outputting data that doesn't comply with the DTD, but it also affects other codecs.