Closed aryamanarora closed 1 year ago
15_35 redo sentence
73,727 errors. Summary below. Attached detailed errors in text file. Working to resolve.
Error, Count the full tag at the end of the line is inconsistent with the rest of the line,16882 MWE lemma is incorrect,14874 Single word expression lemma doesn't match token lemma,14602 invalid lexcat,13585 SWE token must have lexcat., 13585 Invalid supersense(s) in lexical entry, 107 single-word expression .. has lexcat .. which is incompatible with its upos, 68 lexlemma appears incorrect for smwe, 16 unexpected construal: p.PartPortion ~> p.Characteristic, 4 Token is the beginning of a SMWE, but lexlemma doesn't appear to have multiple tokens in it. , 2 unexpected construal: p.Characteristic ~> p.QuantityValue, 1 unexpected construal: p.Gestalt ~> p.Whole, 1
Total 73727
all errors of type 'Single word expression lemma doesn't match token lemma' are cleared. Remaining 45,767 errors to clean.
Decisions taken:
Slowly resolving : Invalid supersenses in lexical entry. But there are some general issues to discuss. Roughly 17.5K errors to resolve now.
Some pronouns are irregular and it's not straightforward to extract the adposition from the irregular pronoun token. This includes genitive, dative, accusative irregular forms, there's no theory I could find behind whether the irregular genitive pronoun does indeed separate into the oblique pronoun form and the genitive post-position. There is some theory supporting this split for irregular accusatives / datives though. What has been done is to endow the irregular pronoun (therefore, the PRON lexcat) with preposition supersenses for Hindi to support annotation of irregular pronouns across multiple cases with SNACS labels.
Some PARTicles (to, bhi, hi, saa) are annotated with the FOCUS label. These may be revisited in v2.7 guidelines, so for now the PART lexcat (newly created for Hindi) has also been endowed with preposition supersenses. The particle 'saa' was annotated by us with non-focus supersenses, i'm now wondering if this was an error as this particle is not typically in a governor-object construction (it loosely translates to the suffix -like and attaches to nouns to make them adjectives).
Some tokens marked SCONJ are given preposition supersenses: exceptions were created for these by marking lexcat = P. The tokens are 'तो','जैसे','ताकि'. तो will be revisited in v2.7 as it is a FOCUS marker; other two should be revisited and supersense label potentially removed.
Some tokens marked ADV are given preposition supersenses; exceptions were created by marking lexcat = P. These tokens are: जैसे,सबसे,जैसे ही,फिर से,पहले,आगे,बाद में
Some tokens marked ADJ are given preposition supersenses as exceptions (lexcat = P). These are: जैसे-जैसे. This is the same जैसे that is marked ADV above, and the ADJ token loosely translates in its context as 'as' (as the time passes, my happiness will increase). [lp_hi_21-81]
Some specific issues to discuss with the error: Invalid supersenses in lexical entry:
sabse pehle [lp_hi_13-85] : Previously extracted and annotated 'se pehle' as the lexlemma but i'm not sure now that the 'se' can be extracted from the 'sabse' as a separate token as 'sabse' has a specific superlative meaning. I would re-think the adposition as just plain 'pehle' and annotate that, leaving the 'sabse' unannotated. The alternative is to endow a superlative adjective (sabse') with preposition supersenses in the validator, which is weird.
ek saath [lp_hi_14-59] : The adposition has been annotated as an MWE expression 'ek saath' which I don't think is correct. I would annotate just the 'saath'.
aisa [lp_hi_14-76]: Has been tagged DET by the UD tagger which i think is legal, it may also be a pre-determiner 'such'. We may have tagged this with preposition supersenses by mistake. EDIT: Exception just for 'aisa' has been created; it's lexcat is assigned P. This should be a temporary measure until a decision to remove all tags for 'aisa' is reached.
'chaaya-sii aakriti' [lp_hi_2-20]: UD has tagged the 'sii' as ADJ, i think it may be a PARTicle following the chaaya. We may have also got the SNACS label wrong (shouldn't be ComparisonRef, maybe Extent?). In general also unsure of whether we should be annotating particles with SNACS labels (except maybe the Focus-related ones).
[lp_hi_20-14] - ताकि is SCONJ but annotated with p.`d. Removed the annotation.
MWE lemma is incorrect: some changes made to resolve these are listed:
1) for MWE expressions, the lexlemma is checked against the word/form instead of the lemma. For single expressions, it's still checked against the lemma 2) MWE expressions where the irregular pronoun is part of the expression that receives a supersense: it's weird to have the irregular pronoun as part of the MWE and there's no concrete theory to split the pronoun into oblique and post-position for irregular genitives (accusatives / datives don't form MWE expressions here). Keeping the irregular genitive in the MWE is in line with the earlier decision to endow these irregular genitives with supersenses (for single-tokens). 3) Sometimes the MWE tokens are flipped in order e.g [lp_hi_8-53], which has बावजूद के instead of के बावजूद. The LEXLEMMA follows this flipped order.
Suggest removing the supersense labels on these tokens altogether.
lp_hi_14_75: single-word expression 'ऐसा' has lexcat P, which is incompatible with its upos DET lp_hi_15_36: single-word expression 'ऐसा' has lexcat P, which is incompatible with its upos DET lp_hi_26_72: single-word expression 'जैसे' has lexcat P, which is incompatible with its upos ADJ lp_hi_21_80: single-word expression 'जैसे-जैसे' has lexcat P, which is incompatible with its upos ADJ
lp_hi_13_84: pehle marked with Time but is ADV. lp_hi_14_33: same as previous (pehle)
Missing supersense annotation in lexical entry. 30 entries to discuss, attached. These are attached. missing_supersense.txt
Sentence ids were updated. E.g lp_hi_13-98 in the new version is lp_hi_13_97 in the old one.
To be resolved here in this sheet, 'missing' tab.
lp_hi_10_74: interesting case because there is an implied argument in a relative clause here which is being marked by the explicit postposition 'ke'.
This is not an adposition, it's an alternative spelling of the complementiser कि (borrowed from Persian ke).
This is not an adposition, it's an alternative spelling of the complementiser कि (borrowed from Persian ke).
Hmm, I don't think so. This can be thought of as:
हर किसी से उसी बात की अपेक्षा रखनी चाहिए जिस (बात) के वह लायक हो
Which is in line with Koul's examples in his grammar book, where the head noun is elided from the relative clause when it follows the main clause. The adposition is marking the implied argument here.
Conllulex file passes validation. Some decisions on irregular pronouns need to be taken (one of the options below):
New Causer label needs a review of these cases
Decisions:
2. Some PARTicles (to, bhi, hi, saa) are annotated with the FOCUS label. These may be revisited in v2.7 guidelines, so for now the PART lexcat (newly created for Hindi) has also been endowed with preposition supersenses. The particle 'saa' was annotated by us with non-focus supersenses, i'm now wondering if this was an error as this particle is not typically in a governor-object construction (it loosely translates to the suffix -like and attaches to nouns to make them adjectives).
Distribution is not quite the same as postpositions, hence the PART tag. Focus is deterministic given lemma, so it's not like we're adding a lot of disambiguation here.
"us-i ki" meaning 'his (emphatic)'. Genitive "ki" gets a supersense. PRON.OBL for "usi", if we don't annotate Focus.
Pros of including Focus:
Cons of including Focus:
Decision:
3. Some tokens marked SCONJ are given preposition supersenses: exceptions were created for these by marking lexcat = P. The tokens are 'तो','जैसे','ताकि'. तो will be revisited in v2.7 as it is a FOCUS marker; other two should be revisited and supersense label potentially removed.
4. Some tokens marked ADV are given preposition supersenses; exceptions were created by marking lexcat = P. These tokens are: जैसे,सबसे,जैसे ही,फिर से,पहले,आगे,बाद में
5. Some tokens marked ADJ are given preposition supersenses as exceptions (lexcat = P). These are: जैसे-जैसे. This is the same जैसे that is marked ADV above, and the ADJ token loosely translates in its context as 'as' (as the time passes, my happiness will increase). [lp_hi_21-81]
For these, add an exception to the validator allowing the lexcat to not match the UPOS
"sab-se" 'than-all/everyone', used to convey superlative meaning: treat as weak MWE
Decisions incorporated in https://github.com/aryamanarora/carmls-hi/pull/34 and merged with the master. Pending treatment of sab-se as weak MWE. Pending updating guidelines with vala discussion from #29
All Force/Causer fixes incorporated. Data passes the validator (it already was before fixes too), so seems like this is the finalised version of the corpus and can be ingested into Xposition.
1_9_5: "छह साल का" Characteristic~Characteristic