parsing MRS - Githubissues

arademaker commented 9 months ago

Variable names are defined as: https://github.com/delph-in/pydelphin/blob/develop/delphin/variable.py#L22

They are post-processed after being parsed as strings. But https://github.com/delph-in/docs/wiki/MrsRFC#simple says

Variable := /[A-Za-z][-A-Za-z]*\d+/

That is, variables can't start with -, and the \w class accepts unicode word characters without the ASCII flag in the re.compile, according to https://docs.python.org/3/library/re.html.

goodmami commented 9 months ago

I see the difference, and I don't recall if that was intentional. However, SimpleMRS is just one serialization format of MRS (e.g., the XML format of MRS has no such constraints, and the JSON format has the pattern ^([-a-zA-Z0-9_]*[-a-zA-Z_])([0-9]+)$). The LKB's basemrs.lisp has a read-mrs-var function that calls read-mrs-atom which calls the Lisp read function, whatever that does. Basically, there's not a lot of consistency, and PyDelphin tries to be a bit flexible to handle some of that inconsistency from various processors. PyDelphin's SimpleMRS codec, for instance, accepts any sequence of characters that is not a space, newline, ] or <.

So my question is: does this discrepancy cause a problem in your processing? If not, I'm not particularly inclined to change things. However, the first priority of PyDelphin's development philosophy is:

Implementations are correct and grammar-agnostic

So if we have a notion what is correct for the syntax of MRS variables encompassing all serialization formats, processors, and grammars, then I would accept a pull request to make that change.

arademaker commented 9 months ago

From https://delphinqa.ling.washington.edu/t/mrs-dmrs-not-displayed-in-ltdb/986/6

The quotedstring in the MRS grammar is defined in https://github.com/delph-in/docs/wiki/MrsRFC#simple as:

QuotedString := /"[^"\\]*(?:\\.[^"\\]*)*"/

Is this the expected behavior? See "namedrel" and "named rel", double quotes were removed and the `` makes part of the string be ignroed.

>>> s = """[ LTOP: h0 INDEX: e2 [ e SF: prop TENSE: untensed MOOD: indicative ] RELS: < [ "named_rel"<-1:-1> LBL: h4 CARG: "pitágoras" ARG0: x3 ARG1: u9 ]  [ "_ladrar_v_rel"<-1:-1> LBL: h1 ARG0: e2 ARG1: x3 ] > HCONS: < h0 qeq h1 > ]"""
>>> m = simplemrs.loads(s); print(simplemrs.dumps(m))
[ TOP: h0 INDEX: e2 [ e SF: prop TENSE: untensed MOOD: indicative ] RELS: < [ named<-1:-1> LBL: h4 ARG0: x3 ARG1: u9 CARG: "pitágoras" ] [ _ladrar_v<-1:-1> LBL: h1 ARG0: e2 ARG1: x3 ] > HCONS: < h0 qeq h1 > ]

>>> s = """[ LTOP: h0 INDEX: e2 [ e SF: prop TENSE: untensed MOOD: indicative ] RELS: < [ "named rel"<-1:-1> LBL: h4 CARG: "pitágoras" ARG0: x3 ARG1: u9 ]  [ "_ladrar_v_rel"<-1:-1> LBL: h1 ARG0: e2 ARG1: x3 ] > HCONS: < h0 qeq h1 > ]"""
>>> m = simplemrs.loads(s); print(simplemrs.dumps(m))
[ TOP: h0 INDEX: e2 [ e SF: prop TENSE: untensed MOOD: indicative ] RELS: < [ named rel<-1:-1> LBL: h4 ARG0: x3 ARG1: u9 CARG: "pitágoras" ] [ _ladrar_v<-1:-1> LBL: h1 ARG0: e2 ARG1: x3 ] > HCONS: < h0 qeq h1 > ]```

goodmami commented 9 months ago

Perhaps you'd be interested in this thread from the old mailing list about predicate names: http://lists.delph-in.net/archives/developers/2015/002154.html

Note that some replies spill over into the new year, so they aren't properly threaded with the above: http://lists.delph-in.net/archives/developers/2016/thread.html

The use of QuotedString for predicates is intentional. Predicates have their own syntax (see https://github.com/delph-in/docs/wiki/PredicateRfc), so all SimpleMRS tries to do is capture the symbol or string in its entirety and leave the predicate interpretation for later. Even MrsRFC's SimpleMRS grammar's TypePred production allows for non-conforming predicates, such as those with fewer or more than 3 fields and with the namespace suffix _rel. To be honest, though, I think the MrsRFC could use further revision as we did not go far enough in separating predicate syntax from SimpleMRS syntax.

Back to the quoted string predicates, we don't enforce any real restrictions of the quoted string, and escaped quotes are fine. Note that "named_rel" and named are considered equivalent (the _rel suffix is normalized away on parsing):

>>> m = simplemrs.decode('[RELS: < [ "named_rel" LBL: h1 ARG0: e2 ] [ named LBL: h3 ARG0: e4 ] >]')
>>> m.rels[0].predicate == m.rels[1].predicate
True
>>> m.rels[0].predicate
'named'

"named rel", however, is not equivalent to the above because the suffix is not _rel with an underscore, so it is not identified as the conventional namespace suffix:

>>> m = simplemrs.decode('[RELS: < [ "named rel" LBL: h1 ARG0: e2 ] [ named LBL: h3 ARG0: e4 ] >]')
>>> m.rels[0].predicate == m.rels[1].predicate
False
>>> m.rels[0].predicate
'named rel'

PyDelphin should only be outputting forms that it can parse again, so your second example indicates a bug in serialization. If there's a space in the predicate name, it should probably be quoted. See how the following cannot be parsed again:

>>> m = simplemrs.decode('[RELS: < [ named rel LBL: h1 ARG0: e2 ] >]')
Traceback (most recent call last):
  [...]
delphin.mrs._exceptions.MRSSyntaxError: 
  line 1, character 17
    [RELS: < [ named rel LBL: h1 ARG0: e2 ] >]
                     ^
MRSSyntaxError: expected: a feature

arademaker commented 9 months ago

Thanks for the references.

So from https://github.com/delph-in/docs/wiki/PredicateRfc#type-vs-string and https://github.com/delph-in/docs/wiki/PredicateRfc#limitations-and-conventions

string vs. type means how the predicate was specified in the grammar
the double quotes are not related to escaping spaces (or other space-like characters) in the name

But I also found

Spaces may not occur in a predicate, except for possibly escaped or non-breaking spaces, but these usages should be discouraged.

A TypePred could never have a space in the name, right? If so, how do you escape the spaces? The StringPred, on the other hand, can have spaces, but grammar writers should avoid it.

Actually, the current WSJ profiles from ERG do not seem to contain StringPred

% for f in ws*; do gzcat $f/result.gz | awk -F "@" '{print $14}' | rg "\"<"; done

goodmami commented 9 months ago

string vs. type means how the predicate was specified in the grammar

Essentially, yes. In MRS-land there is no distinction between "_dog_n_1" and _dog_n_1. In the grammar, using a string predicate means you don't have to place the predicate in a hierarchy; it just inherits from *string*. In practice, you could use a string predicate in MRS to use characters that wouldn't be allowed in a regular symbol, such as spaces, but it's probably best to not use it for that purpose.

the double quotes are not related to escaping spaces (or other space-like characters) in the name

Not sure what you mean by this. There are more limits as to what characters a non-string (type) predicate can contain because of parsing concerns. The current grammar on MrsRFC does not allow, for instance, non-breaking spaces, although I think I recall Stephan wanting to allow them...

A TypePred could never have a space in the name, right? If so, how do you escape the spaces? The StringPred, on the other hand, can have spaces, but grammar writers should avoid it.

We don't currently allow escaped characters outside of strings, and spaces delimit tokens, so string preds are the only way to put spaces (non-breaking or not) inside a predicate symbol. I would like to think that most grammar engineers would choose to not represent spaces in their semantics. I can imagine it happening with generic (unknown word handling) predicates, in some cases, or maybe in Thai, but there are other options for representing words with spaces (e.g., "_ad+hoc_a_1_rel").

arademaker commented 9 months ago

In (2), I just agree with what you said:

In practice, you could use a string predicate in MRS to use characters that wouldn't be allowed in a regular symbol, such as spaces, but it's probably best to not to use it for that purpose.

By the way, MrsRFC has another mistake, TypePred should be

TypePred     := /_?(_[^_\s]+)*(_rel)?/

This is what you have in simplemrs.py but simplified; this means that _ needs to be a prefix of the optional sense, not a suffix.

I know that @oepen cares about these things... hope to have his attention on this issue. I'm not sure if @john-a-carroll can also make comments. The basemrs.lisp has

(defun read-mrs-predicate (stream)
  (loop
      for c = (peek-char nil stream nil nil)
      then (peek-char nil stream nil nil)
      while (and c (whitespacep c)) do (read-char stream nil nil))
  (let* ((c (peek-char nil stream nil nil))
         (string
          (if (char= c #\")
            (read stream nil nil)
            (coerce
             (loop
                 for c = (read-char stream nil nil)
                 while (and c (not (whitespacep c))
                            (not (member c '(#\< #\[ #\") :test #'char=)))
                 collect c
                 finally (when (and c (not (whitespacep c)))
                           (unread-char c stream)))
             'string))))
    (cond
     ((zerop (length string))
      (error
       "unexpected end of file in read-mrs-predicate() at position ~a"
       (file-position stream)))
     (*normalize-predicates-p* (normalize-predicate string))
     ((char= c #\") string)
     (t (vsym string)))))

If I got it right, it will not read _<cat>/NN_u_unknown below:

% ace -g ~/hpsg/erg.dat -Tf1
The <cat> is white
SENT: The <cat> is white
[ LTOP: h0
INDEX: e2 [ e SF: prop TENSE: pres MOOD: indicative PROG: - PERF: - ]
RELS: < [ _the_q<0:3> LBL: h4 ARG0: x3 [ x PERS: 3 NUM: sg ] RSTR: h5 BODY: h6 ]
 [ _<cat>/NN_u_unknown<4:9> LBL: h7 ARG0: x3 ]
 [ _white_a_1<13:18> LBL: h1 ARG0: e2 ARG1: x3 ] >
HCONS: < h0 qeq h1 h5 qeq h7 >
ICONS: < > ]
NOTE: 1 readings, added 1313 / 347 edges to chart (123 fully instantiated, 102 actives used, 69 passives used)  RAM: 4684k

I prefer not to mix the conventions about the names (e.g., the surface predicate names convention of _lemma_pos_sense) with the semantics of MRS. In other words, I would prefer to follow the MrsRFC rather than PyDelphin.

What I didn't get is the SQSYMBOL! The REGEX matches only a single quote followed by one char '([^ \n:<>\[\]])

In the erg/tsdb/gold/ws* I found no occurrence for that:

% for f in ws*; do gzcat $f/result.gz | awk -F "@" '{print $14}' | rg -o "[^ ]+'[^ ]+"; done
_world's+fair_n_1<95:107>
"O'Connor"
"O'Connor"
"O'Connell"
"O'Connor"
"O'Connell"
"O'Connell"
"D'Arcy"
"D'Arcy"
"D'Arcy"
"O'Connor"
"D'Arcy"
"D'Arcy"
"O'Connell"
"O'Connell"
"O'Connell"

goodmami commented 9 months ago

I think you're right that there's a bug in the TypePred production in MrsRFC. E.g., it doesn't fully capture the predicate below:

>>> import re
>>> r = re.compile(r"_?([^_\s]+_)*(_rel)?")
>>> r.match('_foo_n_unmatched')
<re.Match object; span=(0, 7), match='_foo_n_'>

But your proposal also has the same bug on the other side:

>>> r2 = re.compile(r"_?(_[^_\s]+)*(_rel)?")
>>> r2.match('_foo_n_unmatched')
<re.Match object; span=(0, 1), match='_'>

Something like this would work better:

>>> r3 = re.compile(r"_?([^_\s]+_)*[^_\s]+(_rel)?")
>>> r3.match('_foo_n_matched')
<re.Match object; span=(0, 14), match='_foo_n_matched'>

But the next problem is to exclude the lnk characterization:

>>> r3.match('_foo_n_matched<0:4> ')
<re.Match object; span=(0, 19), match='_foo_n_matched<0:4>'>

If we change those character classes to [^_\s<], then we won't match things like _<cat>/NN_u_unknown. This is partly why PyDelphin's surface-predicate regex is so complicated:

>>> r4 = re.compile(r"_[^\s_]+_[nvajrscpqxud](?:_(?:[^\s_<]|<(?![-0-9:#@ ]*>\s))+)?(?:_rel)?")
>>> r4.match('_foo_n_matched')
<re.Match object; span=(0, 14), match='_foo_n_matched'>
>>> r4.match('_foo_n_matched<0:4> ')
<re.Match object; span=(0, 14), match='_foo_n_matched'>
>>> r4.match('_<cat>/NN_u_unknown')
<re.Match object; span=(0, 19), match='_<cat>/NN_u_unknown'>

Two notes about this:

This requires a space after the lnk characterization, otherwise the <0:4> would be part of the match. This isn't ideal, but it works in the context of SimpleMRS, where you'd never have an EP with nothing but a predicate.
This regex is only for unquoted surface predicates; not abstract or quoted predicates.

The issues above are why I said that the MrsRFC page could use some revision. For instance, it might be good to say that < is disallowed anywhere in an unquoted predicate, but then it would not work for the _<cat>/NN_u_unknown example you gave.

I prefer not to mix the conventions about the names (e.g., the surface predicate names convention of _lemma_pos_sense) with the semantics of MRS.

I agree. The reason the regex spells out the convention for unquoted surface predicates is that I rely on the structure to allow < characters is particular places (the lemma field or the sense as long as it's not followed by lnk strings). In fact, all that complexity in the sense field is only to allow < characters.

goodmami commented 9 months ago

What I didn't get is the SQSYMBOL!

This is a deprecated form that persisted in the grammar matrix (and all customized grammars) until mid-2014: https://github.com/delph-in/matrix/blob/d2edc981aec5b347bf82542380a540d6695688bf/matrix-core/matrix.tdl#L3654-L3674

Also see the last two lines from the Type vs String paragraph of the PredicateRfc wiki:

Quoted string preds may only use surrounding double quotes (e.g., "_quote_n_1_rel"). An open-single-quoted variant (e.g., 'null_coord_rel) used to be available, but it has been deprecated.

The single-quoted variant was from Lisp's quoted symbols: https://www.gnu.org/software/emacs/manual/html_node/elisp/Quoting.html

The only reason PyDelphin allows it is to be able to parse the output from old grammars and profiles. However, it seems like there is a bug in the current pattern as it only accepts single-character single-quoted symbols:

>>> from delphin.codecs import simplemrs
>>> simplemrs.decode("[RELS: < [ 'a LBL: h0 ] > ]").rels
[<EP object (h0:a()) at 139694271815616>]
>>> simplemrs.decode("[RELS: < [ 'abc LBL: h0 ] > ]").rels
Traceback (most recent call last):
  [...]
delphin.mrs._exceptions.MRSSyntaxError: 
  line 1, character 13
    [RELS: < [ 'abc LBL: h0 ] > ]
                 ^
MRSSyntaxError: expected: a feature

I haven't heard any complaints about this, so maybe there isn't much demand for support for this legacy syntax.

arademaker commented 9 months ago

Hum, your regex is capturing the Lnk caracterization too, with or without following spaces:

>>> import re
>>> r4 = re.compile(r"_[^\s_]+_[nvajrscpqxud](?:_(?:[^\s_<]|<(?![-0-9:#@ ]*>\s))+)?(?:_rel)?")
>>> r4.match('_<cat>/NN_u_unknown<1,2>')
<re.Match object; span=(0, 24), match='_<cat>/NN_u_unknown<1,2>'>
>>> r4.match('_<cat>/NN_u_unknown<1,2> ')
<re.Match object; span=(0, 24), match='_<cat>/NN_u_unknown<1,2>'>

arademaker commented 9 months ago

The issues above are why I said that the MrsRFC page could use some revision. For instance, it might be good to say that < is disallowed anywhere in an unquoted predicate, but then it would not work for the _/NN_u_unknown example you gave.

I also opened an issue at https://github.com/delph-in/erg/issues/45

arademaker commented 9 months ago

The negative Lookahead would make my life hard! Not sure how to implement it using the parser combinators.

arademaker commented 9 months ago

In the WSJ profiles I found

1._<number>2</number>/NN_u_unknown

_</ref>/JJ_u_unknown
_escapement</NN_u_unknown

goodmami commented 9 months ago

Hum, your regex is capturing the Lnk caracterization too, with or without following spaces:

You have an invalid lnk characterization string: <1,2>. Commas are not used as delimiters. Possible forms are described in the delphin.lnk docs.

>>> import re
>>> r4 = re.compile(r"_[^\s_]+_[nvajrscpqxud](?:_(?:[^\s_<]|<(?![-0-9:#@ ]*>\s))+)?(?:_rel)?")
>>> r4.match('_<cat>/NN_u_unknown<0:4> ')
<re.Match object; span=(0, 19), match='_<cat>/NN_u_unknown'>

The issues above are why I said that the MrsRFC page could use some revision. For instance, it might be good to say that < is disallowed anywhere in an unquoted predicate, but then it would not work for the _/NN_u_unknown example you gave.

I also opened an issue at https://github.com/delph-in/erg/issues/45

For ease of parsing, even better is if we could simply say:

[, ], <, >, :, and whitespace are control characters
"[^"\\]*(?:\\.[^"\\]*)*" is a string
[^[\]<>:"\s]+ is a symbol

Anything else is a lexing error. Control characters and quotes may not be in symbols. If you want control characters in things like predicates, they must be quoted. This would be completely oblivious to predicate conventions.

Unfortunately this would mean that we could no longer parse some predicates that our tools currently output, so it would need to be a backwards incompatible change involving coordination with tool developers and a deprecation period. If you feel strongly about it, propose the changes to MrsRFC, invite discussion, and get consensus.

arademaker commented 9 months ago

You have an invalid lnk characterization string: <1,2>.

Oh... sorry. My mistake. I didn't know about the token indices type of Link. Does Ace support it?

I agree with your suggestion. The read-mrs-predicate from http://svn.delph-in.net/trunk/lingo/lkb/src/mrs/basemrs.lisp is even more restricted:

(coerce
             (loop
                 for c = (read-char stream nil nil)
                 while (and c (not (whitespacep c))
                            (not (member c '(#\< #\[ #\") :test #'char=)))
                 collect c
                 finally (when (and c (not (whitespacep c)))
                           (unread-char c stream)))
             'string)

it collects chars that are:

not whitespace
not <, [, "

Surely all problems with < and > are related to how ERG was possible prepated to read XML and HTML.. Maybe @danflick can help us with some suggestion to simplify the MRS grammar.

goodmami commented 9 months ago

@arademaker that's not more restricted, it's more permissive. That means that > and ] can appear in unquoted symbols (I think the part that reads quoted strings is the (read stream nil nil) just above the part you copied, meaning it reads strings according to Lisp syntax).

arademaker commented 9 months ago

Yet another error in the MRS grammar

EP := "[" Pred Lnk? Label Rarg* Carg? "]"

but CARG appears first in the Ace output:

SENT: The Bla is white.
[ LTOP: h0
INDEX: e2 [ e SF: prop TENSE: pres MOOD: indicative PROG: - PERF: - ]
RELS: < [ _the_q<0:3> LBL: h4 ARG0: x3 [ x PERS: 3 NUM: sg IND: + ] RSTR: h5 BODY: h6 ]
 [ named<4:7> LBL: h7 CARG: "Bla" ARG0: x3 ]
 [ _white_a_1<11:16> LBL: h1 ARG0: e2 ARG1: x3 ] >
HCONS: < h0 qeq h1 h5 qeq h7 >
ICONS: < > ]

I don't know how to serialize the MRS from LKB. (the SimpleMRS format as you call), but the UI shows as the last argument! So it is not defined by the grammar, but by the tool.

arademaker commented 9 months ago

That means that > and ] can appear in unquoted symbols

Yes you are write, they can. But the real problem is the < right?

(defun read-mrs-predicate (stream)
  (loop for c = (peek-char nil stream nil nil)
      then (peek-char nil stream nil nil)
    while (and c (whitespacep c)) do (read-char stream nil nil))
  (let* ((c (peek-char nil stream nil nil))
         (string
           (if (char= c #\")
           ;; if true read the whole string using the Lisp read function
               (read stream nil nil) 
           ;; else reads while NOT whitespace and NOT one of < or [ or "
               (coerce
        (loop for c = (read-char stream nil nil)
              while (and c (not (whitespacep c))
                 (not (member c '(#\< #\[ #\") :test #'char=)))
              collect c
              finally (when (and c (not (whitespacep c)))
                (unread-char c stream)))
        'string))))
    (cond
      ((zerop (length string))
       (error
    "unexpected end of file in read-mrs-predicate() at position ~a"
    (file-position stream)))
      (*normalize-predicates-p* (normalize-predicate string))
      ((char= c #\") string)
      (t (vsym string)))))

goodmami commented 9 months ago

Yet another error in the MRS grammar

EP := "[" Pred Lnk? Label Rarg* Carg? "]"

but CARG appears first in the Ace output:

Hmm, I think you're right that this is overly-prescriptive, and I hadn't noticed that ACE puts it before the role-arguments. There should only be one CARG, but maybe the syntax is the wrong place to enforce that constraint. A minimal change would be something like:

EP     := "[" Pred Lnk? Label FVPair* "]"
FVPair := Rarg | Carg

But the real problem is the < right?

Yes, it's not very ambiguous if you see > in a predicate without having seen < previously. But being overly permissive makes parsing harder, at least for some parsing paradigms. With scannerless parsing, it's not a big deal, but if you're lexing first, it helps to have well-defined control characters and token types.

arademaker commented 8 months ago

Ace does not impose any particular order on the arguments of a predication. It is perfectly happy with LBL not being the first argument, for example.

[ LTOP: h0
INDEX: e2 [ e SF: prop TENSE: pres MOOD: indicative PROG: + PERF: - ]
RELS: < [ _the_q<0:3> LBL: h4 RSTR: h5 ARG0: x3 [ x PERS: 3 NUM: sg IND: + ] BODY: h6 ]
 [ _white_a_1<4:9> LBL: h7 ARG0: e8 [ e SF: prop TENSE: untensed MOOD: indicative PROG: bool PERF: - ] ARG1: x3 ]
 [ _cat_n_1<10:13> ARG0: x3 LBL: h7 ]
 [ _run_v_1<17:24> LBL: h1 ARG0: e2 ARG1: x3 ] >
HCONS: < h0 qeq h1 h5 qeq h7 >
ICONS: < > ]

goodmami commented 8 months ago

@arademaker did that MRS come out of ACE? Because it looks like ACE has the LBL position fixed:

                l += safe_snprintf(str+l, len-l, "%c[ %s<%d:%d> LBL: ", mrs_tab, ep->pred, ep->cfrom, ep->cto);
                l += snprint_mrs_var_marked(str+l, len-l, ep->lbl, marked);
                for(j=0;j<ep->nargs;j++)
                {
                        l += safe_snprintf(str+l, len-l, " %s: ", ep->args[j].name);
                        if(ep->args[j].value)
                                l += snprint_mrs_var_marked(str+l, len-l, ep->args[j].value, marked);
                        else l += safe_snprintf(str+l, len-l, " (null)");
                }
                l += safe_snprintf(str+l, len-l, " ]");

(permalink: https://github.com/delph-in/ace/blob/19576aff0f7c74e6ff904405e2ca21f2c9afe8ff/mrs.c#L644-L653)

LBL is not an argument.

arademaker commented 8 months ago

@goodmami ,

The MRS was edited manually by me, but parsed by Ace

% cat lixo.mrs | ace -g ~/hpsg/erg.dat -e
The yellow white Alfred is running.
The white yellow Alfred is running.
NOTE: 295 passive, 374 active edges in final generation chart; built 376 passives total. [2 results]

NOTE: generated 1 / 1 sentences, avg 5618k, time 0.16564s
NOTE: transfer did 477 successful unifies and 533 failed ones

% cat lixo.mrs
[ LTOP: h0
INDEX: e2 [ e SF: prop TENSE: pres MOOD: indicative PROG: + PERF: - ]
RELS: <
 [ _the_q<0:3> LBL: h4
   RSTR: h5 ARG0: x3 [ x PERS: 3 NUM: sg IND: + ] BODY: h6 ]
 [ _white_a_1<4:9> LBL: h7 ARG0:
   e8 [ e SF: prop TENSE: untensed MOOD: indicative PROG: bool PERF: - ] ARG1: x3 ]
 [ _yellow_a_1<10:34> LBL: h7 ARG0: e9 ARG1: x3 ]
 [ named<10:13> ARG0: x3 LBL: h7 CARG: "Alfred" ]
 [ _run_v_1<17:24> LBL: h1 ARG0: e2 ARG1: x3 ] >
HCONS: < h0 qeq h1 h5 qeq h7 >
ICONS: < > ]

actually, even if I add properties to handlers, see the non-sense MRS below (properties in h6)

% cat lixo.mrs
[ LTOP: h0
INDEX: e2 [ e SF: prop TENSE: pres MOOD: indicative PROG: + PERF: - ]
RELS: <
 [ _the_q<0:3> LBL: h4
   RSTR: h5 ARG0: x3 [ x PERS: 3 NUM: sg IND: + ] BODY: h6 [ h PERF: - ] ]
 [ _white_a_1<4:9> LBL: h7 ARG0:
   e8 [ e SF: prop TENSE: untensed MOOD: indicative PROG: bool PERF: - ] ARG1: x3 ]
 [ _yellow_a_1<10:34> LBL: h7 ARG0: e9 ARG1: x3 ]
 [ named<10:13> ARG0: x3 LBL: h7 CARG: "Alfred" ]
 [ _run_v_1<17:24> LBL: h1 ARG0: e2 ARG1: x3 ] >
HCONS: < h0 qeq h1 h5 qeq h7 >
ICONS: < > ]

Ace parsed, the error was fired during the generation:

% cat lixo.mrs | ace -g ~/hpsg/erg.dat -e -v
NOTE: loading frozen grammar ERG (2023)
NOTE: semantic index hash contains 32886 entries in 65536 slots
NOTE: max-ent model hash contains 449735 entries in 1048576 slots
NOTE: 12063 types, 44117 lexemes, 387 rules, 49 orules, 108 instances, 54553 strings, 264 features
permanent RAM: 3k

external MRS:
[ LTOP: h0 INDEX: e1 [ e SF: prop TENSE: pres MOOD: indicative PROG: + PERF: - ] RELS: < [ _the_q<0:3> LBL: h2 RSTR: h3 ARG0: x4 [ x PERS: 3 NUM: sg IND: + ] BODY: h5 [ h PERF: - ] ]  [ _white_a_1<4:9> LBL: h6 ARG0: e7 [ e SF: prop TENSE: untensed MOOD: indicative PROG: bool PERF: - ] ARG1: x4 ]  [ _yellow_a_1<10:34> LBL: h6 ARG0: e8 ARG1: x4 ]  [ named<10:13> LBL: h6 ARG0: x4 CARG: "Alfred" ]  [ _run_v_1<17:24> LBL: h10 ARG0: e1 ARG1: x4 ] > HCONS: < h0 qeq h10 h3 qeq h6 > ICONS: < > ]
...

You pointed to the pretty-printer, not to the parser, during parser, Ace is very flexible: https://github.com/delph-in/ace/blob/19576aff0f7c74e6ff904405e2ca21f2c9afe8ff/mrs.c#L383-L402

goodmami commented 8 months ago

You pointed to the pretty-printer, not to the parser, during parser, Ace is very flexible:

Yes, I intentionally pointed to the simple-mrs printing code because I was showing that the position of LBL is fixed by ACE on output. You are right that it can flexibly handle non-conventional positions for LBL on input.

My point of view is that PyDelphin should be able to read unconventional MRS serializations that come from tools like ACE or the LKB, but I don't feel a particular need to accommodate variations that come from hand-editing the MRSs. If you feel strongly about such flexibility and make a PR, I'll consider it since it probably doesn't cause any harm, but I'm not interested in making the change myself.

Since this GitHub issue is for PyDelphin and not the DELPH-IN wiki, let me try to review the suggested changes to PyDelphin:

:x: revise the variable pattern in delphin.variable (no change needed)
:heavy_check_mark: fix simplemrs serialization of predicates so those with reserved characters are quoted (#372)
:heavy_check_mark: fix reading of the legacy single-quoted symbols (#373)
:x: allow flexible positioning of LBL (I won't make the change, but I'll review a PR for this)

I think everything else is just questions about the DELPH-IN wiki, ACE, or the LKB, and such questions are probably better on the Discourse site as they'd reach a wider audience.

goodmami commented 8 months ago

Closing this. See the links above for further developments on those issues.

delph-in / pydelphin

parsing MRS #371