delph-in / pydelphin

Python libraries for DELPH-IN
https://pydelphin.readthedocs.io/
MIT License
79 stars 27 forks source link

Possible FFTB or iTSDB problem #298

Closed lmorgadodacosta closed 4 years ago

lmorgadodacosta commented 4 years ago

While trying to process the output of FFTB, I noticed that the "tree" file produced things that pydelphin's itsdb doesn't like. I have a hunch that this is not really pydelphin's problem, but I wanted to double check. You complain about not properly escaped characters. This seems FFTB might be writing some important information as comments. Although this only happened with one of our 5 treebankers.

Example of FFTB output of file 'tree':

20004000150@1@0@-1@lmc@29-06-2020 15:55:04@29-06-2020 16:04:27@system rejects

20005000050@1@1@-1@lmc@29-06-2020 16:05:03@29-06-2020 16:09:33@

20005000060@1@1@-1@lmc@29-06-2020 16:09:34@29-06-2020 16:15:33@

20005000070@1@0@-1@lmc@29-06-2020 16:16:46@29-06-2020 16:26:12@17 new manual\nhd-pct_c\t=\t0 to 3\t[x]\nhdn_bnp_c \sn-hdn_cpd_c\t=\t12 to 14\t[x]\nn_pl_olr \sn_-_mc_le\t=\t16 to 17\t[x]\nn-hdn_cpd_c\t=\t18 to 20\t[x]\nn-hdn_cpd_c\t=\t22 to 25\t[x]\np_np_i_le\t=\t14 to 16\t[x]\nn_-_c_le\t=\t19 to 20\t[x]\nw_hasinitcap_dlr \sav_-_s-cp-mc-pr_le\t=\t0 to 2\t[x]\nn_-_mc_le\t=\t4 to 5\t[x]\nn_sg_ilr \sn_-_pn-gen_le\t=\t5 to 6\t[x]\nn_sg_ilr \sn_-_pn-gen_le\t=\t6 to 7\t[x]\nhdn_bnp-pn_c \snp-hdn_nme-cpd_c\t=\t3 to 9\t[x]\npt_-_comma_le\t=\t8 to 9\t[x]\nv_pst_olr \sv_np_le\t=\t9 to 10\t[x]\nhd_optcmp_c \sv_n3s-bse_ilr \sv_np*_le\t=\t11 to 12\t[x]\npt_-_comma_le\t=\t17 to 18\t[x]\nhd-cmp_u_c\t=\t14 to 25\t[x]\nSystem rejected

20005000140@1@0@-1@lmc@29-06-2020 16:27:45@29-06-2020 16:31:44@I don't think 'send' is a noun but system only provides noun options. \n\nv_n3s-bse_ilr \sv_np*_le\t=\t13 to 14\t[x]\nv_n3s-bse_ilr \sv_np*_le\t=\t11 to 12\t[x]\nhdn_bnp_c \snum-n_mnp_c\t=\t7 to 10\t[x]\ndet_prt-of-agr_dlr \sw_hasinitcap_dlr \sd_-_prt-plm_le\t=\t0 to 1\t[x]\nv_vp_mdl-p-pst_le\t=\t3 to 4\t[x]\nsb-hd_mc_c\t=\t0 to 10\t[x]\nhd_imp_c \sv-v_crd-fin-ncj_c\t=\t10 to 15\t[x]\ncl-cl_runon-cma_c\t=\t0 to 18\t[x]\n

20005000150@1@0@-1@lmc@29-06-2020 16:31:57@29-06-2020 16:32:03@Nothing to parse

20021000070@1@1@-1@lmc@29-06-2020 16:34:51@29-06-2020 16:39:04@

20007000180@1@0@-1@lmc@29-06-2020 16:49:54@29-06-2020 16:53:38@'Slip' I think is an adjective, as in non-slippery meaning of non-slip. They only have noun options.\naj-hdn_norm_c\t=\t0 to 3\t[x]\nn_sg_ilr \sn_-_pn-gen_le\t=\t5 to 6\t[x]\nv_n3s-bse_ilr \sv_np*_le\t=\t12 to 13\t[x]\nv_vp_will-p_le\t=\t8 to 9\t[x]\nsb-hd_nmc_c\t=\t0 to 16\t[x]\npt_-_comma_le\t=\t7 to 8\t[x]\nhd-aj_scp_c\t=\t8 to 16\t[x]\n 

20008000040@1@1@-1@lmc@29-06-2020 16:53:50@29-06-2020 16:58:51@0 - 'accidents', not 'accident' since talking about road accidents in general. 

20008000030@1@1@-1@lmc@29-06-2020 16:59:37@29-06-2020 17:02:55@

20007000020@1@1@-1@lmc@29-06-2020 17:06:29@29-06-2020 17:08:11@

20007000050@1@0@-1@lmc@29-06-2020 17:08:52@29-06-2020 17:14:21@system rejects\nhdn_bnp-pn_c \snp-hdn_nme-cpd_c\t=\t9 to 14\t[x]\nn_sg_ilr \sn_-_pn-gen_le\t=\t10 to 11\t[x]\nn_sg_ilr \sn_-_pn-gen_le\t=\t11 to 12\t[x]\nn_sg_ilr \sn_-_pn-gen_le\t=\t14 to 15\t[x]\nhdn_bnp-pn_c \snp-hdn_nme-cpd_c\t=\t19 to 22\t[x]\nn_sg_ilr \sn_-_pn-gen_le\t=\t20 to 21\t[x]\nsp-hd_n_c\t=\t6 to 22\t[x]\nnp-hdn_nme-cpd_c\t=\t16 to 18\t[x]\npp-pp_mod_c\t=\t0 to 6\t[x]\nv_n3s-bse_ilr \sv_prd_seq-va_le\t=\t22 to 23\t[x]\nav_-_i-vp_le\t=\t23 to 24\t[x]\nhdn_bnp-pn_c \snp-hdn_nme-cpd_c\t=\t9 to 12\t[x]\nhdn_bnp-pn_c \snp-hdn_nme-cpd_c\t=\t19 to 21\t[x]\naj-hd_int_c\t=\t23 to 26\t[x]\nhdn_bnp-pn_c \snp-hdn_nme-cpd_c\t=\t9 to 11\t[x]\n

20007000060@1@0@-1@lmc@29-06-2020 17:14:21@29-06-2020 17:17:57@0 - 'there are..construction barriers' - the sentence is still grammatical. Adding the 'has worsen the situation'makes it ungrammatical. 'has worsen' should also be have worsened'.system rejects \n\nw_hasinitcap_dlr \sn_-_pr-there-x_le\t=\t0 to 1\t[x]\nnon_third_sg_fin_v_rbst \sv_vp_ssr-have_le_rbst\t=\t16 to 17\t[x]\nhd_xcmp_c \sthird_sg_fin_v_rbst \sv_np_noger_le\t=\t17 to 18\t[x]

20007000220@1@1@-1@lmc@29-06-2020 17:18:22@29-06-2020 17:20:40@

20007000230@1@1@-1@lmc@29-06-2020 17:20:40@29-06-2020 17:24:43@0 - "the anti-slip treatment is a cost and time saving solution to the problem'. Anti-slip is an adjective but system doesn't recognise that. 'saving' in this context is also not a verb but more of an adjective. 

20007000090@1@1@-1@lmc@29-06-2020 17:24:57@29-06-2020 17:26:13@
goodmami commented 4 years ago

One thing that might be an issue is the escaped tabs (\t) as TSDB only has 3 escapes:

So that escaped tab should be \\t or just an actual tab character, depending on what was in the source.

I'm also assuming that all the newlines between these records were inserted by you (e.g., by reading and printing each line in Python without doing line.rstrip("\n")) and aren't actual blank lines in the file.

Does any of that resonate with what you see?

lmorgadodacosta commented 4 years ago

So, on emacs each item appears in a new line. But when pasted here they did collapse, so I added an extra line... This is the true copy-paste:

20021000050@1@1@-1@lmc@30-06-2020 03:31:42@30-06-2020 03:41:28@
20021000100@1@0@-1@lmc@30-06-2020 03:42:00@30-06-2020 03:49:16@
20021000120@1@0@-1@lmc@30-06-2020 03:49:28@30-06-2020 03:53:06@0 - 'view on' - 'on' is redundant; 'login in' should be 'logging in'. hdn_bnp_c \shdn_optcmp_c \sw_hasinitcap_dlr \sn_pl_olr \sn_pp_c-of_le\t=\t0 to 1\t[x]\nsp-hd_n_c\t=\t7 to 10\t[x]\nsp-hd_n_c\t=\t11 to 15\t[x]\naj-hdn_norm_c\t=\t13 to 15\t[x]\nv_j-nb-prp-tr_dlr \sv_prp_olr \sv_np*_le\t=\t13 to 14\t[x]\nnp-np_crd-t_c\t=\t7 to 15\t[x]\nthird_sg_fin_v_rbst \sv_-_le\t=\t16 to 17\t[x]\nn-hdn_cpd_c\t=\t24 to 27\t[x]\nn-hdn_cpd_c\t=\t20 to 22\t[x]\nd_-_poss-their_le\t=\t23 to 24\t[x]\nn_sg_ilr \sn_-_c-gr_le\t=\t25 to 26\t[x]\np_np_i_le\t=\t22 to 23\t[x]\nhd_xcmp_c \sv_n3s-bse_ilr \sv_np_le\t=\t5 to 6\t[x]\ncm_vp_to_le\t=\t4 to 5\t[x]\nv_psp_olr \sv_np_le\t=\t2 to 3\t[x]\nhd_xcmp_c \sp_np_i-reg_le\t=\t17 to 18\t[x]\np_np_i-nm-no-tm_le\t=\t18 to 19\t[x]\nn-hdn_cpd-pl_c\t=\t8 to 10\t[x]\n system rejects
20021000020@1@1@-1@lmc@30-06-2020 03:53:57@30-06-2020 04:03:34@not quite sure how to parse. 'for consumers... particular restaurant'. 
20021000010@1@1@-1@lmc@30-06-2020 05:35:47@30-06-2020 05:39:34@

In addition to this weirdness, on some specific comments, I can also see that a new line ("\n") was attempted by the system between the "real comment" and the rubbish that comes with it (e.g. "adjective. \n" below)

20018000170@1@1@-1@lmc@30-06-2020 16:48:58@30-06-2020 16:49:37@
20015000100@1@0@-1@lmc@01-07-2020 01:15:39@01-07-2020 01:19:07@system rejected - doesn't allow 'serial' to be adjective. \nd_-_sg-nmd_le\t=\t3 to 4\t[x]\ndet_prt-nocmp_dlr \sw_hasinitcap_dlr \sd_-_prt-sg_le\t=\t0 to 1\t[x]\nhd_xcmp_c \sv_3s-fin_olr \sv_np_poss_le\t=\t2 to 3\t[x]\nn_sg_ilr \sn_-_c-nocnh_le\t=\t6 to 7\t[x]\naj_-_i_le\t=\t5 to 6\t[x]\n
20016000080@1@1@-1@lmc@01-07-2020 01:19:30@01-07-2020 01:20:29@

It feels more and more clear that this is an FFTB problem. I've asked the treebanker for details on the browser/machine they are using.

(edit: I changed these to use triple grave characters (```) so newlines appear as they should)

goodmami commented 4 years ago

The \n escapes are not an issue from the TSDB side, although I wonder why the "rubbish" is being included at all (a question for the annotation setup).

I think you're correct that these are an issue with FFTB (or rather, libtsdb). If you have a suggestion for a more helpful error message I can consider that for PyDelphin. Perhaps it is possible to print out what exactly was the unexpected escaped character, for instance.

goodmami commented 4 years ago

Woodley has accepted a patch for FFTB so that it does not escape \t or \r. So I'm closing this as it is not an issue with PyDelphin.