aryamanarora / carmls-hi

Hindi SNACS (Semantic Network of Adposition and Case Supersenses; Schneider et al., 2018) annotation scheme and guidelines.
2 stars 3 forks source link

Validator issues #31

Closed aryamanarora closed 1 year ago

aryamanarora commented 2 years ago

1_9_5: "छह साल का" Characteristic~Characteristic

aryamanarora commented 2 years ago

15_35 redo sentence

nitinvwaran commented 2 years ago

73,727 errors. Summary below. Attached detailed errors in text file. Working to resolve.

Error, Count the full tag at the end of the line is inconsistent with the rest of the line,16882 MWE lemma is incorrect,14874 Single word expression lemma doesn't match token lemma,14602 invalid lexcat,13585 SWE token must have lexcat., 13585 Invalid supersense(s) in lexical entry, 107 single-word expression .. has lexcat .. which is incompatible with its upos, 68 lexlemma appears incorrect for smwe, 16 unexpected construal: p.PartPortion ~> p.Characteristic, 4 Token is the beginning of a SMWE, but lexlemma doesn't appear to have multiple tokens in it. , 2 unexpected construal: p.Characteristic ~> p.QuantityValue, 1 unexpected construal: p.Gestalt ~> p.Whole, 1

Total 73727

error.txt

nitinvwaran commented 2 years ago

all errors of type 'Single word expression lemma doesn't match token lemma' are cleared. Remaining 45,767 errors to clean.

Decisions taken:

  1. Irregular pronouns for all adpositions were put in the LEXLEMMA column and get the SNACS labels
  2. Regular pronouns where adposition is obvious, were split into two tokens, second one being the adposition. The adposition gets the SNACS label.
nitinvwaran commented 2 years ago

Slowly resolving : Invalid supersenses in lexical entry. But there are some general issues to discuss. Roughly 17.5K errors to resolve now.

  1. Some pronouns are irregular and it's not straightforward to extract the adposition from the irregular pronoun token. This includes genitive, dative, accusative irregular forms, there's no theory I could find behind whether the irregular genitive pronoun does indeed separate into the oblique pronoun form and the genitive post-position. There is some theory supporting this split for irregular accusatives / datives though. What has been done is to endow the irregular pronoun (therefore, the PRON lexcat) with preposition supersenses for Hindi to support annotation of irregular pronouns across multiple cases with SNACS labels.

  2. Some PARTicles (to, bhi, hi, saa) are annotated with the FOCUS label. These may be revisited in v2.7 guidelines, so for now the PART lexcat (newly created for Hindi) has also been endowed with preposition supersenses. The particle 'saa' was annotated by us with non-focus supersenses, i'm now wondering if this was an error as this particle is not typically in a governor-object construction (it loosely translates to the suffix -like and attaches to nouns to make them adjectives).

  3. Some tokens marked SCONJ are given preposition supersenses: exceptions were created for these by marking lexcat = P. The tokens are 'तो','जैसे','ताकि'. तो will be revisited in v2.7 as it is a FOCUS marker; other two should be revisited and supersense label potentially removed.

  4. Some tokens marked ADV are given preposition supersenses; exceptions were created by marking lexcat = P. These tokens are: जैसे,सबसे,जैसे ही,फिर से,पहले,आगे,बाद में

  5. Some tokens marked ADJ are given preposition supersenses as exceptions (lexcat = P). These are: जैसे-जैसे. This is the same जैसे that is marked ADV above, and the ADJ token loosely translates in its context as 'as' (as the time passes, my happiness will increase). [lp_hi_21-81]

Some specific issues to discuss with the error: Invalid supersenses in lexical entry:

  1. sabse pehle [lp_hi_13-85] : Previously extracted and annotated 'se pehle' as the lexlemma but i'm not sure now that the 'se' can be extracted from the 'sabse' as a separate token as 'sabse' has a specific superlative meaning. I would re-think the adposition as just plain 'pehle' and annotate that, leaving the 'sabse' unannotated. The alternative is to endow a superlative adjective (sabse') with preposition supersenses in the validator, which is weird.

  2. ek saath [lp_hi_14-59] : The adposition has been annotated as an MWE expression 'ek saath' which I don't think is correct. I would annotate just the 'saath'.

  3. aisa [lp_hi_14-76]: Has been tagged DET by the UD tagger which i think is legal, it may also be a pre-determiner 'such'. We may have tagged this with preposition supersenses by mistake. EDIT: Exception just for 'aisa' has been created; it's lexcat is assigned P. This should be a temporary measure until a decision to remove all tags for 'aisa' is reached.

  4. 'chaaya-sii aakriti' [lp_hi_2-20]: UD has tagged the 'sii' as ADJ, i think it may be a PARTicle following the chaaya. We may have also got the SNACS label wrong (shouldn't be ComparisonRef, maybe Extent?). In general also unsure of whether we should be annotating particles with SNACS labels (except maybe the Focus-related ones).

  5. [lp_hi_20-14] - ताकि is SCONJ but annotated with p.`d. Removed the annotation.

nitinvwaran commented 2 years ago

MWE lemma is incorrect: some changes made to resolve these are listed:

1) for MWE expressions, the lexlemma is checked against the word/form instead of the lemma. For single expressions, it's still checked against the lemma 2) MWE expressions where the irregular pronoun is part of the expression that receives a supersense: it's weird to have the irregular pronoun as part of the MWE and there's no concrete theory to split the pronoun into oblique and post-position for irregular genitives (accusatives / datives don't form MWE expressions here). Keeping the irregular genitive in the MWE is in line with the earlier decision to endow these irregular genitives with supersenses (for single-tokens). 3) Sometimes the MWE tokens are flipped in order e.g [lp_hi_8-53], which has बावजूद के instead of के बावजूद. The LEXLEMMA follows this flipped order.

nitinvwaran commented 2 years ago

Suggest removing the supersense labels on these tokens altogether.

lp_hi_14_75: single-word expression 'ऐसा' has lexcat P, which is incompatible with its upos DET lp_hi_15_36: single-word expression 'ऐसा' has lexcat P, which is incompatible with its upos DET lp_hi_26_72: single-word expression 'जैसे' has lexcat P, which is incompatible with its upos ADJ lp_hi_21_80: single-word expression 'जैसे-जैसे' has lexcat P, which is incompatible with its upos ADJ

lp_hi_13_84: pehle marked with Time but is ADV. lp_hi_14_33: same as previous (pehle)

nitinvwaran commented 2 years ago

Missing supersense annotation in lexical entry. 30 entries to discuss, attached. These are attached. missing_supersense.txt

Sentence ids were updated. E.g lp_hi_13-98 in the new version is lp_hi_13_97 in the old one.

To be resolved here in this sheet, 'missing' tab.

aryamanarora commented 2 years ago

lp_hi_10_74: interesting case because there is an implied argument in a relative clause here which is being marked by the explicit postposition 'ke'.

This is not an adposition, it's an alternative spelling of the complementiser कि (borrowed from Persian ke).

nitinvwaran commented 2 years ago

This is not an adposition, it's an alternative spelling of the complementiser कि (borrowed from Persian ke).

Hmm, I don't think so. This can be thought of as:

हर किसी से उसी बात की अपेक्षा रखनी चाहिए जिस (बात) के वह लायक हो

Which is in line with Koul's examples in his grammar book, where the head noun is elided from the relative clause when it follows the main clause. The adposition is marking the implied argument here.

nitinvwaran commented 1 year ago

Conllulex file passes validation. Some decisions on irregular pronouns need to be taken (one of the options below):

New Causer label needs a review of these cases

nschneid commented 1 year ago

Decisions:

nschneid commented 1 year ago

2. Some PARTicles (to, bhi, hi, saa) are annotated with the FOCUS label. These may be revisited in v2.7 guidelines, so for now the PART lexcat (newly created for Hindi) has also been endowed with preposition supersenses. The particle 'saa' was annotated by us with non-focus supersenses, i'm now wondering if this was an error as this particle is not typically in a governor-object construction (it loosely translates to the suffix -like and attaches to nouns to make them adjectives).

Distribution is not quite the same as postpositions, hence the PART tag. Focus is deterministic given lemma, so it's not like we're adding a lot of disambiguation here.

"us-i ki" meaning 'his (emphatic)'. Genitive "ki" gets a supersense. PRON.OBL for "usi", if we don't annotate Focus.

Pros of including Focus:

Cons of including Focus:

Decision:

nschneid commented 1 year ago

3. Some tokens marked SCONJ are given preposition supersenses: exceptions were created for these by marking lexcat = P. The tokens are 'तो','जैसे','ताकि'. तो will be revisited in v2.7 as it is a FOCUS marker; other two should be revisited and supersense label potentially removed.

4. Some tokens marked ADV are given preposition supersenses; exceptions were created by marking lexcat = P. These tokens are: जैसे,सबसे,जैसे ही,फिर से,पहले,आगे,बाद में

5. Some tokens marked ADJ are given preposition supersenses as exceptions (lexcat = P). These are: जैसे-जैसे. This is the same जैसे that is marked ADV above, and the ADJ token loosely translates in its context as 'as' (as the time passes, my happiness will increase). [lp_hi_21-81]

For these, add an exception to the validator allowing the lexcat to not match the UPOS

nschneid commented 1 year ago

"sab-se" 'than-all/everyone', used to convey superlative meaning: treat as weak MWE

nitinvwaran commented 1 year ago

Decisions incorporated in https://github.com/aryamanarora/carmls-hi/pull/34 and merged with the master. Pending treatment of sab-se as weak MWE. Pending updating guidelines with vala discussion from #29

aryamanarora commented 1 year ago

All Force/Causer fixes incorporated. Data passes the validator (it already was before fixes too), so seems like this is the finalised version of the corpus and can be ingested into Xposition.