UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
274 stars 249 forks source link

Old Irish separated nasal particles #927

Closed AdeDoyle closed 1 year ago

AdeDoyle commented 1 year ago

I'm having trouble selecting an appropriate deprel for a feature of Old Irish common in manuscripts whereby a nasal ("m" or "n") stand alone as a particle. These particles have no semantic meaning of their own, however, they occur in a number of grammatical situations (to mark case, in relative clauses, etc.).

Usually these would be attached to the anlaut of the following word, isdered betho inso "this is the end of the world". This is ok, because they form part of a single token, as the does with the word betho in the example here. The problem is that these nasals can also be found separated by spacing and even punctuation from the following word, laa brátha / lae .m. bratho "day of doom". In these cases they are tagged as PART, because they still lack any semantic meaning. However, they now must have a syntactic relation to the following word.

Initial consonant mutations like these are a common feature of the Insular Celtic languages. The modern Celtic languages have settled orthographies, and this kind of mutation never stands alone, as it can in Old Irish. I've defined amod:mutation in the language specific documentation to deal with these instances, however, it is not really adjectively modifying the following word. As such, I think a new dependency relation should be considered for Old Irish: mutation or icm (initial consonant mutation).

dan-zeman commented 1 year ago

If the mutation nasal always occurs at this position and it is only the unsettled orthography that sometimes inserts a space or a punctuation symbol, then this might be the situation where a word with space (or multitoken word, see also here) could exceptionally be permitted.

AdeDoyle commented 1 year ago

You mean a token that includes a space, "m betha". That might be the best solution.

dan-zeman commented 1 year ago

You mean a token that includes a space, "m betha". That might be the best solution.

Yes, that's what I mean.

AdeDoyle commented 1 year ago

Ok, I've made the changes to the treebank, created a words_w_spaces.sga file, and the treebank is passing validation. I'm happy with this solution if you are.

Incidentally, the validator still seems to report that no .../words_w_spaces.sga file exists if a word with spacing throws an error for any reason, even though a file is present. My regex couldn't match a special character in one of my examples, and it told me there was no file. The file was there, it just needed to be altered to account for the special character.

dan-zeman commented 1 year ago

Ok, I've made the changes to the treebank, created a words_w_spaces.sga file, and the treebank is passing validation. I'm happy with this solution if you are.

Not sure I'd describe myself as happy when it comes to words with spaces :-) but here we are, the mechanism exists in UD, and this seems like a situation where it is actually justified.

Incidentally, the validator still seems to report that no .../words_w_spaces.sga file exists if a word with spacing throws an error for any reason, even though a file is present. My regex couldn't match a special character in one of my examples, and it told me there was no file. The file was there, it just needed to be altered to account for the special character.

The name of the file is data/tokens_w_space.sga. The validator definitely should not complain about words_w_spaces.sga. And with the correct name, it should not complain when the file exists, as it tests its existence right before complaining. I inserted a space at a random position in your data, ran the validator, got a message about a wrong space but not about the missing file, so I couldn't reproduce the behavior you describe:

python tools\validate.py --lang sga UD_Old_Irish-DipSGG\sga_dipsgg-ud-test.conllu
[Line 24 Sent 2]: [L4 Format invalid-word-with-space] 'seich etar' in column FORM is not on the list of exceptions allowed to contain whitespace (data/tokens_w_space.LANG files).
[Line 24 Sent 2]: [L2 Metadata text-form-mismatch] Mismatch between the text attribute and the FORM field. Form[11] is 'seich etar' but text is 'seichetar cid acomroicniu...'
[Line 29 Sent 2]: [L2 Metadata text-extra-chars] Extra characters at the end of the text attribute, not accounted for in the FORM fields: 'seichetar cid acomroicniu'
Format errors: 1
Metadata errors: 2
*** FAILED *** with 3 errors