delph-in / erg

English Resource Grammar
MIT License
17 stars 3 forks source link

Unknown predicate symbols with spaces #2

Closed goodmami closed 5 years ago

goodmami commented 7 years ago

JHU profile item # 3035631

edit: example redacted because it came from a test corpus...

The 50 000 is (correctly) tokenized as a single number, but the predicate symbol includes the space directly:

... [ _50 000/nn_u_unknown<25:31> LBL: h33 ARG0: x28 ] ...

In general I think the correct thing to do is escape the space: _50\ 000/nn_u_unknown, but it's actually a no-break space character U+00A0 and not the regular space U+0020, so maybe my processor (pyDelphin, in this case) should um, not break at the no-break space?

goodmami commented 6 years ago

I think the result of a discussion for this was that it's not ideal to have the non-breaking space there, since usually spaces in predicates would get a + to delimit the sides, but in pydelphin I will stop breaking on non-breaking spaces anyway.

danflick commented 5 years ago

The preprocessor now inserts an underscore rather than a non-breaking space for 50 000, so it shouldn't cause trouble for other tools that may not be as robust as pydelphin is now.