delph-in / pydelphin

Python libraries for DELPH-IN
https://pydelphin.readthedocs.io/
MIT License
79 stars 27 forks source link

Pydelphin cannot parse used_to_prp_v_rbst|use_to_aux_sp_rbst := v_p-vp_prp_le & #292

Closed fcbond closed 4 years ago

fcbond commented 4 years ago

Pydelphin cannot parse this (from the latest ERG-makeover), but ace can

used_to_prp_v_rbst|use_to_aux_sp_rbst := v_p-vp_prp_le &
 [ ORTH < "use" >,
   SYNSEM [ LKEYS [ --COMPKEY _to_p_sel_rel,
                    KEYREL.PRED "_used+to_v_qmodal_rel" ],
            PHON.ONSET con ],
   GENRE robust ].

If I replace '|' with '___' it parses fine.

goodmami commented 4 years ago

The | character is specifically excluded from TDL identifiers. See this part of the TdlRfc wiki.

The reason is because #| is the sequence to start a block comment, but if | is allowed as an identifier, it is ambiguous with a coindexation (e.g., [ FOO #|, BAR.BAZ #| ]). See this message on the developer's list. This is therefore a problem with the ERG and not PyDelphin.

I don't think ACE or the old LKB has been revised to match the new specification, but LKB-FOS has. Can you try it with LKB-FOS to confirm?

goodmami commented 4 years ago

@danflick @oepen I don't have access to the latest ERG so I cannot test this, but Francis uncovered a problem with an identifier using an invalid character. I found that this is actually in the current trunk already, but I wasn't testing PyDelphin on files in subdirectories like educ/. Until I can see the ERG 2020 files, you can run something like this to look for other TDL errors:

find . \
     -name \*.tdl \
     ! -name \*config\*.tdl \
     -exec echo {} \; \
     -exec python -c 'import sys, delphin.tdl; list(delphin.tdl.iterparse(sys.argv[1]))' {} \;

(I ignore *config*.tdl because ACE configs are not really TDL)

Running this on the current trunk branch uncovered two PyDelphin bugs related to TDL parsing (#293 and #294), so be aware of those as you test.

fcbond commented 4 years ago

It would actually be helpful if I could also parse the config.tdl for the ltsb.

It mainly parses, the only problem is things like this:

preprocessor-modules := ../rpp/xml.rpp ../rpp/ascii.rpp ../rpp/quotes.rpp.

I fear these are not legal TDL --- can they be made so in some way?

On Fri, Jun 5, 2020 at 9:47 AM Michael Wayne Goodman < notifications@github.com> wrote:

@danflick https://github.com/danflick @oepen https://github.com/oepen I don't have access to the latest ERG so I cannot test this, but Francis uncovered a problem with an identifier using an invalid character. I found that this is actually in the current trunk already, but I wasn't testing PyDelphin on files in subdirectories like educ/. Until I can see the ERG 2020 files, you can run something like this to look for other TDL errors:

find . \ -name *.tdl \ ! -name *config*.tdl \ -exec echo {} \; \ -exec python -c 'import sys, delphin.tdl; list(delphin.tdl.iterparse(sys.argv[1]))' {} \;

(I ignore config.tdl because ACE configs are not really TDL)

Running this on the current trunk branch uncovered two PyDelphin bugs related to TDL parsing (#293 https://github.com/delph-in/pydelphin/issues/293 and #294 https://github.com/delph-in/pydelphin/issues/294), so be aware of those as you test.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/delph-in/pydelphin/issues/292#issuecomment-639209185, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRRLJFOYDOLEKLVSJMTRVBFBJANCNFSM4NSTZHDA .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 4 years ago

@fcbond I have some more radical thoughts on this which I may bring up at the virtual summit.

Less radically, the file could include some way to indicate it's not regular TDL:

The last option introduces a change to TDL, but a backward compatible one I think.

danflick commented 4 years ago

Hi Mike,

Yes, that illegal character should have been in a commented-out definition, as it's part of a little experiment to see if I could have a single lexical mal-entry trigger two distinct error messages. I'll figure out a TDL-legal way to do that, and in the meantime, I've commented out that problematic definition in mo', soon to betrunk'. I'll run your script, thanks, and see what else needs attending to.

Dan


From: Michael Wayne Goodman notifications@github.com Sent: Thursday, June 4, 2020 6:47 PM To: delph-in/pydelphin pydelphin@noreply.github.com Cc: Dan Flickinger danf@stanford.edu; Mention mention@noreply.github.com Subject: Re: [delph-in/pydelphin] Pydelphin cannot parse used_to_prp_v_rbst|use_to_aux_sp_rbst := v_p-vp_prp_le & (#292)

@danflickhttps://github.com/danflick @oepenhttps://github.com/oepen I don't have access to the latest ERG so I cannot test this, but Francis uncovered a problem with an identifier using an invalid character. I found that this is actually in the current trunk already, but I wasn't testing PyDelphin on files in subdirectories like educ/. Until I can see the ERG 2020 files, you can run something like this to look for other TDL errors:

find . \ -name *.tdl \ ! -name *config*.tdl \ -exec echo {} \; \ -exec python -c 'import sys, delphin.tdl; list(delphin.tdl.iterparse(sys.argv[1]))' {} \;

(I ignore config.tdl because ACE configs are not really TDL)

Running this on the current trunk branch uncovered two PyDelphin bugs related to TDL parsing (#293https://github.com/delph-in/pydelphin/issues/293 and #294https://github.com/delph-in/pydelphin/issues/294), so be aware of those as you test.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/delph-in/pydelphin/issues/292#issuecomment-639209185, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG5PC4WAZQL3RZU4K7MFHA3RVBFBJANCNFSM4NSTZHDA.