UniversalDependencies / UD_Swedish-Talbanken

Swedish data
Other
13 stars 2 forks source link

Misannotations in Swedish-Talbanken #8

Closed AleksandrsBerdicevskis closed 9 months ago

AleksandrsBerdicevskis commented 3 years ago

There are four cases when "då" 'then' has UPOS = SCONJ and XPOS = AB ('adverb'). In all of them, UPOS should be "ADV".

jnivre commented 3 years ago

Thanks, Sasha. Will fix.

AleksandrsBerdicevskis commented 3 years ago

"reda" (as in "ta/få/ha/hålla reda på") is sometimes tagged as NOUN, and sometimes as ADV, I cannot see any principle behind it.

(Is it OK I am posting different bugs in the same issue? There is probably more to come.)

jnivre commented 3 years ago

It is fine to keep all of this in one issue (especially as long as it is concerned with part-of-speech tags). Are you only looking at Talbanken, or will you also consider LinES and PUD?

AleksandrsBerdicevskis commented 3 years ago

Talbanken only. Basically, we at Språkbanken want to be able to convert our POS(+MSD) annotation (SUC-style) to UD. I know you already did the conversion once, but looking at your mamba2ud.py I see that: a) it relies on mamba annotation to some extent; b) it is tailored to Talbanken, i.e. uses many ad hoc solutions; c) some solutions are clearly outdated and have been fixed later. So I decided it would be more efficient to create the new converter and started doing that, looking at the UD guidelines and Talbanken for reference. If I notice any more inconsistencies, I can report them here. Does that make sense?

jnivre commented 3 years ago

It makes perfect sense.

AleksandrsBerdicevskis commented 3 years ago

Two things about adverbs: 1) There are 81 cases when an ADV has an incoming amod relation, which should not be the case, if I understand the guidelines correctly. In most of these cases, amod should be changed to advmod, but in some of them, the pos tag should be changed from ADV to ADJ (and the XPOS has to be changed, too, but I don't know if you care about that). Example of the latter: sentence sv-ud-test-1098, token 34 (strängt).

2) Adverbs that are homonymous with neuter adjectives (samtidigt) are typically (but not always) lemmatized in the same way as adjectives. i.e. samtidig. But does that make sense? If they are treated as adverbs (pace SAG), they should be lemmatized with -t (samtidig cannot be used as an adverb). An alternative solution would be to treat them as adjectives in the adverbial function, then the lemmatization is reasonable.

jnivre commented 3 years ago

Re 2: The lemmatization has been done automatically (using an early version of the Språkbanken pipeline) and has only been partially validated manually. So my guess is that these are all (?) cases where the disambiguation has gone wrong. Do you at Språkbanken generally assign the lemma "samtidigt" (rather than "samtidig") to the adverb "samtidigt"? If yes, then we should do the same in the treebank. Presumably most of this could be fixed automatically by searching for words tagged ADV that end in "-t" and have a lemma without the "-t". Or could anything go wrong here?

AleksandrsBerdicevskis commented 3 years ago

No, we are actually also lemmatizing it as "samtidig"! This is not very consistent, since in SUC, which is supposed to be gold standard, the lemmatization is "samtidigt". I suspect this is an error rather than conscious policy, but I can check. I think your rule should generally work fine, but a) it should not just add -t, but rather use the word as lemma (because of words that end in -d etc.); b) it will not work with comparatives and superlatives (which are also inconsistent, cf. tidigare-tidigare, vidare-vitt, djupare-djupt, but usually have lemmas with -t).

LarsAhrenberg commented 3 years ago

I follow this discussion with interest as Swedish_Lines is not consistent either with respect to lemmas for adverbs. My current preference is for t-forms as lemmas as there are some words where the meaning of the adverb is somewhat different from that of the corresponding adjective, for example 'riktig(t), väldig(t)' I have noticed a difference between Sparv and Efselab wrt lemmatization of these words, Efselab uses t-forms, Sparv does not.

A related question concerns the comparatives and superlatives of adverbs. As in 'Hon sprang längre än i går'. I guess the lemma should then be långt.

AleksandrsBerdicevskis commented 3 years ago

I think that's because Sparv's choice of lemma is constrained by Saldo, and Saldo does not list t-adverbs as separate entries (following SAG).

In my opinion, if samtidigt is considered an adverb (as in SUC), the lemma should be samtidigt. If it is considered a neuter adjective (as in Saldo and SAG), the lemma should be samtidig. Both decisions make sense, but POS=adverb, lemma=samtidig seems inconsistent. Comparatives and superlatives should have the same lemma as positives, I think.

Thanks for your replies! It's useful to become aware of these discrepancies across (and within) the Swedish resources.

AleksandrsBerdicevskis commented 3 years ago

som, när och with XPOS=HA (relative adverb) sometimes have UPOS=SCONJ (which seems correct to me) and sometimes ADV (which does not). I cannot see any principle in the distribution, they all seem subjunctions to me (or "relative adverbs" in SUC's terminology). It also affects syntactic relations (mark vs advmod)

jnivre commented 3 years ago

I suspect that ADV + advmod is used when it is analyzed as introducing a relative clause, rather than an adverbial clause, for example, in "dagen då hon kom" (as opposed to a case like "det regnade, då hon kom"). It is debated (not only for Swedish) whether these are really relative clauses, and there may still be inconsistencies, but I think this is the intended distinction.

jnivre commented 3 years ago

This is analogous to the use of locative adverbs introducing relative clauses, for example, in "huset där hon bodde", where there is no corresponding adverbial clause use.

AleksandrsBerdicevskis commented 3 years ago

Ah, thanks, I see now. I think there are some inconsistencies (especially wrt när), but it's difficult to tell without a thorough comparison.

gregarshinov commented 3 years ago

I have a question considering sent_id = sv-ud-test-188 in Talbanken. The 26th token (sammansatt) is annotated weirdly. It's lemma is "sätta_samman" instead of "sammansatt". Is there any explanation to this annotation?

Also, I do not understand why in sent_id = sv-ud-test-262, token 3 (första en ADJ) is lemmatized as an article and tagged as an Adjective, and not as an ordinal numeral. Could somebody, please, clarify this?

dan-zeman commented 10 months ago

I follow this discussion with interest as Swedish_Lines is not consistent either with respect to lemmas for adverbs.

Since this issue is prevailingly about (possible) bugs in Talbanken, I am going to move it from the docs issue tracker to the UD_Swedish-Talbanken issue tracker. If there are any open questions about Swedish-specific annotation guidelines, feel free to open a new issue here (in docs) for each such question.

jnivre commented 9 months ago

@gregarshinov First of all, sorry for the extremely slow response!

Regarding sent_id = sv-ud-test-188: The lemmatization in Swedish Talbanken has been performed automatically, and we have not been able to manually validate everything yet. When it comes to "sammansatt", it is the past participle of a particle verb that in the infinitive may be realized either as "sammansätta" (as a so-called fixed compound) or as "sätta samman" (as a so-called loose compound). The standard in the dictionary underlying the lemmatiser is to use the loose compound as the lemma for all forms of such particle verbs. Hence, "sätta_samman" is the correct lemma for "sammansatt". However, the lemmatiser is not perfect, so I have noticed that the form "sammansatta" (the plural form corresponding to "sammansatt") is lemmatised to "sammansatt", which is inconsistent. When we have time to do a full manual validation, such inconsistencies should of course be fixed.

Regarding sent_id = sv-ud-test-262: The word "första" (first) is not lemmatised as the article "en" ("a(n)"), but as the numeral "en" ("one"). Since these are homographs, there should probably be some way of distinguishing the lemmas, but the lemmatiser does not do this. The tag ADJ and the deprel amod is correct for ordinal numbers according to the guidelines. (Only cardinal numbers get NUM and nummod.)

I hope this clarifies the situation. Apologies again for the lateness of the reply.

jnivre commented 9 months ago

All of the problems reported on this page have now been fixed, except the lemmatisation of adverbs ending in -t, where there is no consensus yet (and a change would require extensive manual work).

rueter commented 9 months ago

@jnivre @LarsAhrenberg

Regarding sent_id = sv-ud-test-262: The word "första" (first) is not lemmatised as the article "en" ("a(n)"), but as the numeral "en" ("one"). Since these are homographs, there should probably be some way of distinguishing the lemmas, but the lemmatiser does not do this. The tag ADJ and the deprel amod is correct for ordinal numbers according to the guidelines. (Only cardinal numbers get NUM and nummod.)

This caught my eye as an instance where the neutral cardinal form "ett" might provide the desired distinction, since counting without heads goes: "ett, två, tre..." Just a suggestion, I understand the amount of manual work involved.

jnivre commented 9 months ago

@rueter Thanks! The only problem is that we are using an external lemmatiser based on the SALDO lexicon, and we want to stay compatible with that if possible.