Open nschneid opened 3 years ago
RE: identifying amod in names, I ran a BERT-based classifier on GUM deprels, and I get 93.65% accuracy from text, or 94.17 with gold xpos as a feature. F-score on the amod class is 93.59 or 96.19 with xpos. However looking more closely at only NNP(S) cases for the best classifier, it's:
metric | score |
---|---|
prec | 0.815789 |
rec | 0.704545 |
f1 | 0.756098 |
This is for confusion of amod with anything, not just compound, and conversely for confusion of anything with amod.
I'm happy to post an auto-fixed version of EWT using this classifier, but I'm not sure if we'd consider this an improvement (because at worst, if it predicts non-amod, we'd just leave it compound, and if it predicts amod it quite possibly is), or is this worse because then things would be inconsistent. I can also output prediction probabilities if that would be helpful (e.g. only review less certain cases).
I think this classifier could be helpful in combination with lexical heuristics (e.g. whether the word ever appears as an ADJ in the corpus outside a name, or as an adjective in WordNet).
Looking at current lemmas, I see an intersection of 270 types between PROPN and ADJ, corresponding to 911/16885 = 5% of PROPN tokens. In the version of GUM I have it is 409/8700 = 4% of PROPNs that have amod. So maybe not too bad of a heuristic?
Yes, this sounds good. Personally I think adding amod NNPs would be beneficial even if we couldn't catch all of them, and then the remaining tail of cases could be gradually caught in future PRs. Would you like me to put up an auto labeled version somewhere? As noted, I could also include probabilities if those are helpful.
Sure, you could email it if that's easiest. Probabilities could help.
Sure, I'll send it over. Looks like for dev it predicts just 36 cases, so maybe it's not so many total.
lab, meal, metal, phoenix, salary addressed in the above commit.
Leaving open for other difficulties noted in the original post.
As a preface to trying to identify adjectives in proper names, I did a quick scan of lemmas appearing as both ADJ and PROPN in the corpus. The following ones look questionable:
Currently ADJ, should be NOUN
Typically in attributive position, so
amod
should be replaced withcompound
:lab meal (worm) metal phoenix salary
Borderline:
Difficulties with compounds and particles
post/X election/ADJ (see also #152 on separated affixes)
up/ADJ: these look suspicious
spot/ADJ on/ADJ (amod) dead/ADJ on/ADJ (amod) run/ADJ down/ADP (compound:prt)
good/ADJ looking/ADJ (amod: implied by https://universaldependencies.org/u/pos/ADJ.html, but not sure why it’s not compound)
leave it on/ADJ
Lemma issues
b, w in B&W: should be lemmatized as black, white with
Abbr=Yes