delph-in / erg

English Resource Grammar
MIT License
17 stars 3 forks source link

wsj corpus: unexpected predicate names #46

Open arademaker opened 8 months ago

arademaker commented 8 months ago

wsj201

Item

1000000400370@unknown@formal@none@1@S@⌊•⌊#1965, ⌊>H. A. Simon>⌋: "[M]achines will be capable, within twenty years, of doing any work a man can do"#⌋@@@@1@19@@oe@26-8-2013

% ace -g ../erg.dat -E
⌊•⌊#1965, ⌊>H. A. Simon>⌋: "[M]achines will be capable, within twenty years, of doing any work a man can do"#⌋
1965 , H. A. Simon : “ [ M]achines will be capable , within twenty years , of doing any work a man can do ”

The token M]achines generates the predicate _m]achines/NNS_u_unknown. Does it make sense?

arademaker commented 8 months ago
% ace -g ../erg.dat -E 
⌊∗The clock∗⌋: Bolter credits the invention of the weight-driven ⌊>clock>⌋ as “The key invention [of Europe in the Middle Ages]", in particular the ⌊>verge escapement>⌋< (Bolter 1984:24) that provides us with the tick and tock of a mechanical clock.
The clock : Bolter credits the invention of the weight - driven clock as “ The key invention [ of Europe in the Middle Ages ] ” , in particular the verge escapement< ( Bolter 1984:24 ) that provides us with the tick and tock of a mechanical clock .

Se the < in the word escapement< . Maybe a bug introduced when the markups were added?

Hi @danflick , see https://github.com/delph-in/pydelphin/issues/371#issuecomment-1818265817; a complicated regex is needed to allow < in the name of the predicates. Can we avoid that? I prefer to consider the predicate names convention from ERG as not part of the MRS text representation grammar.

I could not confirm the original content. Both https://catalog.ldc.upenn.edu/LDC2013T19 and https://catalog.ldc.upenn.edu/LDC99T42 do not contain the 201 set.

fcbond commented 8 months ago

I guess we would need a special pattern for brackets around the first letter or letters: [Mm]achine. Do we have something for the optional plural as in word(s)?

arademaker commented 8 months ago
% ace -g ../erg.dat -E
The word(s) 
The word(s)

% ace -g ../erg.dat -Tf
The word(s)
SENT: The word(s)
[ LTOP: h0
INDEX: e2 [ e SF: prop ]
RELS: < [ unknown<0:11> LBL: h1 ARG0: e2 ARG: x4 [ x PERS: 3 NUM: pl IND: + ] ]
 [ _the_q<0:3> LBL: h5 ARG0: x4 RSTR: h6 BODY: h7 ]
 [ _word_n_of<4:11> LBL: h8 ARG0: x4 ARG1: i9 ] >
HCONS: < h0 qeq h1 h6 qeq h8 >
ICONS: < > ]
NOTE: 1 readings, added 428 / 50 edges to chart (17 fully instantiated, 22 actives used, 11 passives used)  RAM: 1337k

There is no mark for the 'optional' plural. word(s) is always considered plural and one single token.