UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
199 stars 42 forks source link

ADJ tokens that should be reconsidered #153

Open nschneid opened 3 years ago

nschneid commented 3 years ago

As a preface to trying to identify adjectives in proper names, I did a quick scan of lemmas appearing as both ADJ and PROPN in the corpus. The following ones look questionable:

Currently ADJ, should be NOUN

Typically in attributive position, so amod should be replaced with compound:

lab meal (worm) metal phoenix salary

Borderline:

Difficulties with compounds and particles

post/X election/ADJ (see also #152 on separated affixes)

up/ADJ: these look suspicious

spot/ADJ on/ADJ (amod) dead/ADJ on/ADJ (amod) run/ADJ down/ADP (compound:prt)

good/ADJ looking/ADJ (amod: implied by https://universaldependencies.org/u/pos/ADJ.html, but not sure why it’s not compound)

leave it on/ADJ

Lemma issues

b, w in B&W: should be lemmatized as black, white with Abbr=Yes

amir-zeldes commented 3 years ago

RE: identifying amod in names, I ran a BERT-based classifier on GUM deprels, and I get 93.65% accuracy from text, or 94.17 with gold xpos as a feature. F-score on the amod class is 93.59 or 96.19 with xpos. However looking more closely at only NNP(S) cases for the best classifier, it's:

metric score
prec 0.815789
rec 0.704545
f1 0.756098

This is for confusion of amod with anything, not just compound, and conversely for confusion of anything with amod.

I'm happy to post an auto-fixed version of EWT using this classifier, but I'm not sure if we'd consider this an improvement (because at worst, if it predicts non-amod, we'd just leave it compound, and if it predicts amod it quite possibly is), or is this worse because then things would be inconsistent. I can also output prediction probabilities if that would be helpful (e.g. only review less certain cases).

nschneid commented 3 years ago

I think this classifier could be helpful in combination with lexical heuristics (e.g. whether the word ever appears as an ADJ in the corpus outside a name, or as an adjective in WordNet).

Looking at current lemmas, I see an intersection of 270 types between PROPN and ADJ, corresponding to 911/16885 = 5% of PROPN tokens. In the version of GUM I have it is 409/8700 = 4% of PROPNs that have amod. So maybe not too bad of a heuristic?

amir-zeldes commented 3 years ago

Yes, this sounds good. Personally I think adding amod NNPs would be beneficial even if we couldn't catch all of them, and then the remaining tail of cases could be gradually caught in future PRs. Would you like me to put up an auto labeled version somewhere? As noted, I could also include probabilities if those are helpful.

nschneid commented 3 years ago

Sure, you could email it if that's easiest. Probabilities could help.

amir-zeldes commented 3 years ago

Sure, I'll send it over. Looks like for dev it predicts just 36 cases, so maybe it's not so many total.

nschneid commented 3 years ago

lab, meal, metal, phoenix, salary addressed in the above commit.

Leaving open for other difficulties noted in the original post.