UniversalDependencies / UD_English-GUM

Other
32 stars 4 forks source link

Inconsistent annotations for LS entries #74

Open rhdunn opened 11 months ago

rhdunn commented 11 months ago
  1. List items are annotated with the X UPOS. -- EWT favours NUM for these and https://universaldependencies.org/u/pos/X.html states that it should be used restrictively.
  2. These should have the NumType=Ord feature (as they specify an ordered list of items).
  3. The 1. etc variants should have the NumForm=Digit feature.
  4. The a) etc variants should have a NumForm feature, but no suitable form currently exists for these; maybe NumForm=Alpha (alphabetic -- "Examples: a, b, c, α, β, γ").
  5. EWT tokenizes the ( and ) as separate tokens.

Validation issues:

ERROR: Sentence GUM_textbook_governments-1 token 1 -- invalid X form '1.1'
ERROR: Sentence GUM_academic_eegimaa-1 token 1 -- invalid X form '2.'
ERROR: Sentence GUM_textbook_chemistry-1 token 1 -- invalid X form '2.1'
ERROR: Sentence GUM_textbook_chemistry-13 token 1 -- invalid X form '1.'
ERROR: Sentence GUM_textbook_chemistry-15 token 1 -- invalid X form '2.'
ERROR: Sentence GUM_textbook_chemistry-20 token 1 -- invalid X form '3.'
ERROR: Sentence GUM_textbook_chemistry-21 token 1 -- invalid X form '4.'
ERROR: Sentence GUM_textbook_chemistry-26 token 1 -- invalid X form '5.'
ERROR: Sentence GUM_academic_census-1 token 1 -- invalid X form '1'
ERROR: Sentence GUM_academic_economics-1 token 1 -- invalid X form '2.'
ERROR: Sentence GUM_academic_economics-2 token 1 -- invalid X form '2.1.'
ERROR: Sentence GUM_academic_economics-35 token 1 -- invalid X form '2.2.'
ERROR: Sentence GUM_academic_epistemic-23 token 30 -- invalid X form '8'
ERROR: Sentence GUM_academic_implicature-1 token 1 -- invalid X form '4.'
ERROR: Sentence GUM_academic_implicature-7 token 1 -- invalid X form '4.1.'
ERROR: Sentence GUM_academic_implicature-30 token 1 -- invalid X form '5.'
ERROR: Sentence GUM_academic_lighting-13 token 1 -- invalid X form '1.'
ERROR: Sentence GUM_academic_mutation-8 token 1 -- invalid X form '1.'
ERROR: Sentence GUM_academic_mutation-17 token 1 -- invalid X form '2.'
ERROR: Sentence GUM_academic_mutation-45 token 1 -- invalid X form '3.'
ERROR: Sentence GUM_academic_replication-12 token 14 -- invalid X form '(a)'
ERROR: Sentence GUM_academic_replication-12 token 25 -- invalid X form '(b)'
ERROR: Sentence GUM_academic_replication-12 token 36 -- invalid X form '(c)'
ERROR: Sentence GUM_academic_replication-20 token 11 -- invalid X form '(a)'
ERROR: Sentence GUM_academic_replication-20 token 18 -- invalid X form '(b)'
ERROR: Sentence GUM_academic_replication-20 token 25 -- invalid X form '(c)'
ERROR: Sentence GUM_academic_replication-20 token 35 -- invalid X form '(d)'
ERROR: Sentence GUM_academic_salinity-1 token 1 -- invalid X form '1.'
ERROR: Sentence GUM_bio_nida-33 token 1 -- invalid X form '1.'
ERROR: Sentence GUM_bio_nida-34 token 1 -- invalid X form '2.'
ERROR: Sentence GUM_bio_nida-35 token 1 -- invalid X form '3.'
ERROR: Sentence GUM_interview_herrick-48 token 3 -- invalid X form '1)'
ERROR: Sentence GUM_interview_herrick-48 token 20 -- invalid X form '2)'
ERROR: Sentence GUM_news_defector-35 token 19 -- invalid X form 'a)'
ERROR: Sentence GUM_news_defector-35 token 35 -- invalid X form 'b)'
ERROR: Sentence GUM_textbook_artwork-9 token 1 -- invalid X form '38.'
ERROR: Sentence GUM_textbook_artwork-27 token 1 -- invalid X form '39.'
ERROR: Sentence GUM_textbook_artwork-29 token 1 -- invalid X form '40.'
ERROR: Sentence GUM_textbook_artwork-31 token 1 -- invalid X form '41.'
ERROR: Sentence GUM_textbook_grit-1 token 1 -- invalid X form '2.2'
ERROR: Sentence GUM_textbook_history-1 token 1 -- invalid X form '1'
ERROR: Sentence GUM_textbook_history-2 token 1 -- invalid X form '1.1'
ERROR: Sentence GUM_textbook_history-73 token 1 -- invalid X form '1.'
ERROR: Sentence GUM_textbook_history-78 token 1 -- invalid X form '2.'
ERROR: Sentence GUM_textbook_spacetime-1 token 1 -- invalid X form '24.2'
ERROR: Sentence GUM_textbook_stats-1 token 1 -- invalid X form '2.3'
ERROR: Sentence GUM_voyage_isfahan-25 token 1 -- invalid X form '1'
ERROR: Sentence GUM_voyage_isfahan-31 token 1 -- invalid X form '2'
ERROR: Sentence GUM_voyage_isfahan-39 token 1 -- invalid X form '3'
ERROR: Sentence GUM_voyage_isfahan-43 token 1 -- invalid X form '4'
ERROR: Sentence GUM_voyage_isfahan-48 token 1 -- invalid X form '5'
ERROR: Sentence GUM_voyage_isfahan-50 token 1 -- invalid X form '6'
ERROR: Sentence GUM_voyage_isfahan-58 token 1 -- invalid X form '7'
ERROR: Sentence GUM_voyage_isfahan-64 token 1 -- invalid X form '8'
ERROR: Sentence GUM_voyage_isfahan-67 token 1 -- invalid X form '9'
ERROR: Sentence GUM_whow_basil-7 token 1 -- invalid X form '1'
ERROR: Sentence GUM_whow_basil-16 token 1 -- invalid X form '2'
ERROR: Sentence GUM_whow_basil-20 token 1 -- invalid X form '3'
ERROR: Sentence GUM_whow_basil-24 token 1 -- invalid X form '4'
ERROR: Sentence GUM_whow_basil-30 token 1 -- invalid X form '5'
ERROR: Sentence GUM_whow_basil-35 token 1 -- invalid X form '1'
ERROR: Sentence GUM_whow_basil-44 token 1 -- invalid X form '2'
ERROR: Sentence GUM_whow_basil-47 token 1 -- invalid X form '3'
ERROR: Sentence GUM_whow_basil-52 token 1 -- invalid X form '4'
ERROR: Sentence GUM_whow_basil-58 token 1 -- invalid X form '1'
ERROR: Sentence GUM_whow_basil-66 token 1 -- invalid X form '2'
ERROR: Sentence GUM_whow_basil-68 token 1 -- invalid X form '3'
ERROR: Sentence GUM_whow_basil-72 token 1 -- invalid X form '4'
amir-zeldes commented 11 months ago

I could see using Ord for the numerical ones, but until we sort out what we're doing about LS I will leave this open. I anticipate this will stay as-is for v2.13.

AngledLuffa commented 7 months ago

Ping regarding this ( and @nschneid) ... one of the more frequent errors caused by the CoreNLP constituency -> dependency converter is because it wants to make the dependency "num" but the UPOS "X". If we come up with a standard and apply it to the EWT & GUM treebanks, I can implement that in the converter pretty easily.

AngledLuffa commented 7 months ago

https://github.com/UniversalDependencies/docs/issues/717

nschneid commented 7 months ago

Yeah, we need a standard. It's under discussion in the core group.