UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
272 stars 247 forks source link

On the treatment of some sentences from poems #810

Closed mehmetoguzderin closed 2 years ago

mehmetoguzderin commented 3 years ago

In suffixing agglutinative languages, at least in Turkic languages, poets frequently add sounds at the end of some lines to make more memorable or harmonic poems. Although I think interjections are an acceptable (?) way to tag the "letters" of these sounds, the inclusion of precedent lines would be a better way to capture these phrases. I would like to ask: does it sound to fit with UD guidelines to treat such sections combined into one sentence (often makes up for the length of one given the brevity of individual lines) using parataxis relation, as precedent lines might have been repeated throughout the poem and a hashing reducer might discard these lines as duplicates if treated distinctly (which eliminates the capture-using-document-sequence possibility)?

Note: This post was first on the mailing list by https://cl.lingfil.uu.se/pipermail/ud/2021-August/000695.html URL.

Stormur commented 3 years ago

Do you have a practical example for this phenomenon? It might help to better frame it.

mehmetoguzderin commented 3 years ago

@Stormur an excerpt from the famous poem of Yunus Emre:

Işkun aldı benden beni Bana seni gerek seni

These two lines constitute two separate sentences, the critical catch here is that the first "-i" attached to "sen" in the second sentence is not an accusative marker but rather an intensifier, as the line is recently shown to be related to Ahmet Yesevi's "meŋe sen ok kereksen" phrase (notice the lack of last "-i" in the original). So here, I would suggest combining such two sentences into one like "Işkın aldı benden beni; bana seni gerek seni" in case it would be fit to guidelines since that catches a piece of extra information. There are other such instances but could not spell OTOH, I will write down if I come across.

Stormur commented 3 years ago

I admit that the original problem is not entirely clear to me. If the two pieces are separate sentences, I think it is better to keep them separate instead of artifically conflating them together with a "non-relation" like parataxis. How do you deem to mark this "extra information"? It seems to be something that goes beyond a morphological or syntactical analysis, it is more a kind of textual cohesion with external references. What I mean is that a syntactical measure like taking both sentences together does not help: the analysis you propose is something that should happen in another way anyway.

With regard to the specific annotation of such "metrical fillers", the part of speec INTJ is in my opinion a very bad choice. Since this small elements are linguistically null, I think the best UPOS should be just X. Then, I don't know if something can be done as a morphological trait, like Metric=Yes.

mehmetoguzderin commented 3 years ago

As these sentences do not have explicit sentence segmentation at their origin (as such they do not have an analog to semicolon), this stands as a possibility, that's the essence of the issue. Though I would like to ask, why would INTJ be very bad? Sometimes it can be a whole syllable on its own, deriving from older exclamatory expressions (see Yenisei texts), so by their time of authoring they might be more than just a stretch of the word, and AFAIK, other languages tag that kind of tokens consistently as INTJ.

mehmetoguzderin commented 3 years ago

(BTW there is a very loose causality between these two sentences, if that goes to zero in any instance, I would avoid parataxis too)

Stormur commented 3 years ago

OK, probably I was a little carried away in the previous intervention. From your description, these pieces looked as mere metrical additions, devoid of any morphological, syntactical, semantical and also pragmatical role whatsoever. Such a thing does not belong to any part of speech, because this would mean that it has some properties at some of the aforementioned levels, and the UPOS for such "residual" elements is X, notwithstanding some specification in some other field.

The problem with INTJ is that on one side, it implies a "true" syntactic relation, which is usually discourse: but it does not really seem the case here. A secondary consideration is that an interjection is quasi by definition disconnected from the rest of the sentence: it is usually more of a loose association, and if it becomes more than that (e.g. through grammaticalisation), then I think that this would need a different syntactic (or morphological) representation. In fact, you use the definition of intensifier and speak of derivation: this goes more in the direction of a phenomenon like those discussed about emphasis in #741 . On another side, in my opinion the label for INTJ should be limited to expression of a "non-lexical" nature, like haha for laughter or bang for a loud noise. Other interjectional expressions (like damn) have instead a clear interpretation, and what can vary is their role (e.g. discourse instead of verbal head).

To sum it up, the elements we are taking into consideration here look too integrated (even if not from a truly linguistic point of view) into their context and too little independent to be treated as INTJ, even if they could somehow shown to derive from INTJ-like particles. One point is, for example, that the addition of an i to sen is not so arbitrary: it follows from vowel harmony. Could we admit a ... Bana sen, ah!, gerek seni as equivalent? The fact that, as you too point out, this -i appears as an "extension" of another word rather than as an independent element would make me annotate it by means of morpholexical traits only, with no need for a dedicated syntactical representation.

OK, of course these are general considerations from a very partial knowledge of the phenomenon. But there is the possibility that inside this broader context ("sound additions") there might be the need to differentiate different constructions!

mehmetoguzderin commented 3 years ago

Oops, my bad! The first "-i" would rather (if it was to be treated consistently in its ancestor sentence's particle, Uk) be tagged as a particle (much like a question particle "Mu" which also always follows tongue root harmony and is always tagged as a particle token) and attached to the preceding element as advmod. The question is rather about the final "-i" where semantically I think it is not too off to read it a bit like: "Bana seni gerek sen, ah!" (or to mix centuries to highlight research finding about the meaning of sentence: "Bana sen uk gerek sen, ah!")

mehmetoguzderin commented 3 years ago

(so being about tagging final -i as INTJ and not the first -i)

Stormur commented 3 years ago

OK, my arguments stay the same!

By the way, as already stated, I don't think semantics is the right level of analysis here. Be it an INTJ or not, there is no meaning attached to this element. Also, I would not consider INTJs as having "meanings".

Apart from this, it remains the problem if syntactically an -i extension or an adjoined ah are and behave the same.

mehmetoguzderin commented 3 years ago

There my preference would stay as INTJ for better cross-lingual analysis, these are the most productive "emotional" expressions in older texts (going per the definition where UD says "typically expresses an emotional reaction, is not syntactically related to other accompanying expressions, and may include a combination of sounds not otherwise found in the language" and Wiki says "a word or expression that occurs as an utterance on its own and expresses a spontaneous feeling or reaction"), and the fact that older texts would treat these as "Oh! Alas! O!" (see translations of Yenisei texts as mentioned before).

The sentence under inspection is not the easiest one to read in terms of its intentions (I mean, it took few centuries before the proper function of first -i is highlighted), so more examples will help with framing better to show how this annotation leads to a construction that's true to the meaning and allows better cross-lingual study (besides for the original question on the sentence segmentation of texts that have no sentence segmentation markers per their layout). I will bring related texts but at a later time to avoid clashes with other processes. Meanwhile, thanks for your feedback!

Stormur commented 3 years ago

You're welcome!

Just a last note: beware that the Wiki page is clearly not defining a part-of-speech class in the sense that we use here, but rather a broader, less defined category of linguistic elements used in given circumstances.

coltekin commented 3 years ago

[Joining the party rather late]

If I understand the above proposal correct, the -i suffixes in the second verse of the above example are to be treated as syntactic words (segmented) and assigned to INTJ POS tag.

I agree that -i here has a completely different meaning/function in comparison to usual Case=Acc (or Number[psor]=Sing,Person[psor]=3). However, I think this is still a morphological process, and it should better be specified by a (language specific) morphological feature.

Although I am familiar with the poem above, I do not know the (historical) function of the suffix enough to suggest a morphological feature-value pair. However, from the description above, Emphatic=Yes may be an option. It seems to be used in some languages for a similar function.

On a related note, the more modern usage of -i as in

Sen-i gidi fındık kıran

may also be related to this usage of the suffix.

Finally, I'd analyze the example verses above as separate sentences in UD. They are (naturally) related, but I do not think there is a syntactic relation between the two verses.

mehmetoguzderin commented 3 years ago

@coltekin Thanks a lot for the input! I didn't ponder the second example prior, but it also seems to follow in the evolution to me, as you pointed out. (For bringing more cases on my end, the process I mentioned in https://github.com/UniversalDependencies/docs/issues/810#issuecomment-916279246 continues, sorry about that...)

mehmetoguzderin commented 2 years ago

An upcoming work to be published stabilizes the choice here to avoid such parataxis; thanks a lot for the input, @Stormur and @coltekin and everybody! Therefore, I am closing this issue; other aspects might be better to discuss in their own space after the publication has its presentation and content release.