clarinsi / jos2ud

1 stars 0 forks source link

Convert 'biti' to AUX or VERB #5

Closed kajad closed 5 years ago

kajad commented 5 years ago

The JOS-auxiliary verb 'biti' (to be) converts to either AUX or VERB in UD, depending on its syntactic role. This rule should be moved from the convert_dependencies.py script to the morphology conversion script, and extended so as to include tokens without any syntactic analysis, as well (i.e. the entire ssj500k corpus). The rule can be summarized in the following way:

1. For biti-Va tokens with a dependency annotation (ssj250k):

'biti-Va' as a copula:

(Note that for the rather complex situations in b. and c., a list of concrete tokens meeting this rule in ssj250k can also be provided.)

2. For tokens without any dependency annotation:

When @TomazErjavec implements this rule in his morphology script, @kajad removes it from the dependency script.

TomazErjavec commented 5 years ago

This rule is too complicated to be incorporated into the current coversion, as the mapping rules don't allow for such dependencies, and I can hardly imagine extending them so much - esp. as they would cover only one word. So, I will write a special script just for "biti". However, to do this it would be very helpful if @kajad were to change the script that inserts UD dependencies into ssj500k so that it would produce only one file first, as suggested in #1. Then I could run this script first and so get the test output with which I can compare the output of my script to see if they match and so avoid otherwise difficult to spot errors in my script. So, waiting until #1 is closed.

kajad commented 5 years ago

A single treebank file is already produced as part of the current conversion script, it's named: "release-all_{}_{}.conllu".format(treebank_name, version_name). See the latest/current output in /home/kaja/ud/syntax/UDv2.2, i.e.

release-all_ssj500kv1_6.ud_33.conllu = sl_ssj-ud-train + sl_ssj-ud-dev + sl_ssj-ud-test

kajad commented 5 years ago

For your convenience, I have also uploaded the convert_dependencies_v2_v33_no-data-split.py script that:

(i) does not do the data split; and (ii) has the single released treebank file named differently, i.e. sl_ssj-ud_v{}.conllu

You run it in the same way: py script morpho_output version_name e.g. py convert_dependencies_v2_v33_no-data-split.py ssj500kv1_6.ud.tbl 1

to get the sl_ssj-ud_v1.conllu treebank.

TomazErjavec commented 5 years ago

What I did:

These changes are implemented in 35783fd.

TomazErjavec commented 5 years ago

Currently the output of add-biti-syn.pl is compared to UD/output_ssj500k-en.ud.syn_2.2.conllu , however, the final output is probably sl_ssj-ud_v2.2.conllu which has less sentences. If this is so, we need a script (which could already be bin/convert_dependencies.py) that removes the sentences for which the UD parse cannot be made automatically. Once this is settled and @kajad removes her "biti" UD cat rules we can close this issue.

kajad commented 5 years ago

The necessary changes have now been implemented in convert_dependencies.py. For list of changes, see issue https://github.com/clarinsi/jos2ud/issues/8#issue-401665521. Closing.