delph-in / pydelphin

Python libraries for DELPH-IN
https://pydelphin.readthedocs.io/
MIT License
79 stars 27 forks source link

SimpleMRS codec fails when predicates contain colons #302

Closed goodmami closed 4 years ago

goodmami commented 4 years ago

The form of predicates is very permissive. From the PredicateRfc wiki page:

The lemma field of a surface pred may be just about anything that does not contain underscores or spaces.

The "just about" leaves room for interpretation. What I settled on before was a sequence of anything that wasn't a space, newline ], or a < followed by something that looks like a lnk value.

https://github.com/delph-in/pydelphin/blob/b94007341dbd0fd35f82c543d45cdab5dc04cc7f/delphin/codecs/simplemrs.py#L148-L149

This is tripping up on some predicates (that aren't great, but it's hard to say they are invalid, either) that have ]:

_+-]\?[/NN_u_unknown_rel"<12:18>

Also, if a colon appears in the predicate, it can look like a feature (e.g., LBL:), which breaks on things like urls, etc.:

_xml:tm/NN_u_unknown<57:63>

While it would be great to have a more restrictive syntax for these things, for now PyDelphin will need to deal with them somehow, even if it's just throwing some error about an invalid predicate form.

oepen commented 4 years ago

hi mike,

in general, i agree: those examples are not great, but at least some may be valid syntax. looking at the PredicateRfc page, we arguably still need to make those rules more specific. in general, i believe we have two notations for MRS predicates, (a) as C-like quoted strings and (b) as 'plain' (unquoted) identifiers (traditionally called 'symbols' in lisp).

in variant (a), i would think anything goes (within any additional constraints we might impose on the fields of surface predicates) and standard escape conventions apply: \" for embedded double quotes and \\ for embedded backslashes. i am pretty sure the MRSs from the pre-2020 ERG that you are reading are generated by @sweaglesw (if that were his M$ GitHub user), and i am wondering whether we could make the overall pipeline more robust by suggesting that FFTB serialize predicates as quoted strings and, in doing so, make sure that embedded quote marks and backslashes are escaped.

for variant (b), the rules seem a bit murkier. reading the PredicateRfc page literally, the following should all be valid surface predicates _:_c_1, "_v_2, <_p_3, or _[_x_4. because the page likens the sense field to the lemma field, one could also argue for the following: <_p_<, "_p_", or even <_p_angle<. for abstract predicates, our current PredicateRfc does not seem to impose any constraints, strictly speaking?

seeing as ERG predicates for unknown words are generated on the fly, and further taking into account that we want the lemma field of a surface predicate to correspond to the actual surface string, i would be sceptical of imposing too many constraints on valid MRS predicates. but, nevertheless, we need a syntax that is well defined and parseable with adequate effort. i believe banning a wider range of characters in type-(b) predicates and generally pushing toward broader use of type-(a) quoting and escaping might allow us to attain these goals?

goodmami commented 4 years ago

Thanks, Stephan,

I mostly agree with what you said. In general I think we need to consider 2 things: the allowed shape of predicates, and issues with serialization in any particular format. As this GitHub issue is specifically about SimpleMRS, I'll take the more general discussion to the message on the developers list. If you think my solution to this particular issue is not correct, we can open a new issue (as this one is already merged into a release) to fix it further.

oepen commented 4 years ago

just fyi: i have in the meantime updated the LKB reader and writer for simple MRS serializations to accept all characters except ", [, and < in unquoted predicates and arbitrary strings when quoted using standard C conventions. upon serialization, the writer will output a quoted string iff one of the above characters is present in the predicate value.

goodmami commented 4 years ago

Thanks, Stephan, that sounds reasonable. Two questions:

oepen commented 4 years ago

internally in the LKB (when predicate normalization is enabled, which has been the case for the ERG since 1214 and is now becoming the default in the stock LKB), all predicates are represented as strings. so serialization defaults to the unquoted format, because that seems more (human) reader-friendly to me. thus, reading and writing an MRS in the LKB will normalize the extend to which quoting is used, i.e. in practise can remove some unnecessary quotes.

regarding the range of allowed characters, you are right: my inclusion of [ betrays my intention to use the same quoting conventions for (native) EDS serialization and possibly also other (non-XML, non-JSON) MRS serializations, e.g. the indexed format. i am inclined to immediately add { to my list of characters unwelcome in unquoted predicates, and maybe ( should go on that list too?

alternatively, one could take the position that simple MRS has the most generous rules for unquoted predicates (banning only ", <, and whitespace), and other serializations add their own constraints (e.g. [ and { for EDS). i realize i ended up putting what should be a follow-up to the email thread you started into M$ GitHub ...

goodmami commented 4 years ago

Re: normalization, that sounds good to me

Re: allowed characters, I’m afraid I find these dual-purpose expedients harmful in the long run, leading to confusion and inconsistency. Instead, as I advocated in the thread, I think we should have a universal set of escaping conventions for predicates, namely that a literal _ is escaped to \s (or similar) and \ to \\. Each format (simplemrs, eds, prolog-mrs, indexed mrs, etc.) may then impose additional constraints as fits their own requirements. In this view, the quotes surrounding some predicates in simplemrs should never persist in an internal representation of the predicate, they are merely artifacts of simplemrs encoding.

But yes, maybe we should return to the email thread :) On Aug 12, 2020, 6:29 PM +0800, Stephan Oepen , wrote:

internally in the LKB (when predicate normalization is enabled, which has been the case for the ERG since 1214 and is now becoming the default in the stock LKB), all predicates are represented as strings. so serialization defaults to the unquoted format, because that seems more (human) reader-friendly to me. thus, reading and writing an MRS in the LKB will normalize the extend to which quoting is used, i.e. in practise can remove some unnecessary quotes. regarding the range of allowed characters, you are right: my inclusion of [ betrays my intention to use the same quoting conventions for (native) EDS serialization and possibly also other (non-XML, non-JSON) MRS serializations, e.g. the indexed format. i am inclined to immediately add { to my list of characters unwelcome in unquoted predicates, and maybe ( should go on that list too? alternatively, one could take the position that simple MRS has the most generous rules for unquoted predicates (banning only ", <, and whitespace), and other serializations add their own constraints (e.g. [ and { for EDS). i realize i ended up putting what should be a follow-up to the email thread you started into M$ GitHub ... — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe.