delph-in / erg

English Resource Grammar
MIT License
17 stars 3 forks source link

inconsistent analysis of symbols < and > #45

Open arademaker opened 9 months ago

arademaker commented 9 months ago

Try parse

The symbol > is cool
The symbol < is cool

The fist sentence has 2 readings. The second one 7 readings. The < is always interpreted as _less+than_a_1. The > can be _greater+than_a_1 or quoted.

arademaker commented 9 months ago

also, only for <word> ERG keep the symbols. Try

  1. I have a [cat].
  2. I have a (cat).
  3. I have a <cat>.

@danflick see also https://github.com/delph-in/pydelphin/issues/371

danflick commented 9 months ago

The lexicon already includes a separate NP entry for the use of ">" as the name of the symbol, but lacked an analogous entry for "<". I have added the missing entry, and will check it in with the next update.
As for the brackets surrounding a word as in "I have a [cat]" it does not seem desirable to try to insert them into the name of the predicate, or into the value of the ARG attribute when the token is a named entity. I agree that it would be good to find some way to record the presence of these bracketing punctuation marks in the resulting MRS, but we'll need to figure out how best to do so.

arademaker commented 9 months ago

Sorry GitHub interpreted the greater-than and less-than symbols, I edited my previous comment.

@danflick, the crucial problem is the presence of < in the name of the predicate without any escape or double quotes. For parsing the text representation of the MRS, we need help to distinguish it easily from the beginning of the Link (character positions). See here

arademaker commented 8 months ago

As for the brackets surrounding a word as in "I have a [cat]" it does not seem desirable to try to insert them into the name of the predicate, or into the value of the ARG attribute when the token is a named entity. I agree that it would be good to find some way to record the presence of these bracketing punctuation marks in the resulting MRS, but we'll need to figure out how best to do so.

@danflick, my problem is the opposite if I understood your comment above. Why preserve the < and > in the token?! I was expecting the same behavior for all, that is, separate tokens for <, >, [, ], ( and ).

% ace -g ../erg.dat -E   
The <cat> is write
The <cat> is write

The [cat] is white 
The [ cat ] is white

The (cat) is white
The ( cat ) is white